Click any tag below to further narrow down your results
Links
The article discusses error handling strategies in software systems, particularly in cloud environments. It emphasizes that error handling should be a global property of the system rather than a local one, considering factors like failure correlation, architecture capabilities, and the potential to continue operations. The author also highlights the importance of blast radius reduction techniques.
This article discusses strategies for building resilient distributed systems that can handle extreme traffic spikes, like those seen during major events. It highlights four key architectural patterns: load shedding, isolation through bulkheads, request collapsing, and conducting game day rehearsals to ensure systems can withstand high demand without crashing.
Klaviyo has developed a resilient event publisher using a dual failure capture design to ensure that no incoming events are lost during processing, even amidst network issues or serialization errors. By integrating Kafka topics and S3 for backup, the system can efficiently handle failures and maintain real-time event publishing for its customers. The implementation has proven effective, with significant automatic retries and event recovery in production.