Quit Emailing Yourself

# resilience → system-design

3 links tagged with all of: resilience + system-design

Click any tag below to further narrow down your results

Links

What Now? Handling Errors in Large Systems

The article discusses error handling strategies in software systems, particularly in cloud environments. It emphasizes that error handling should be a global property of the system rather than a local one, considering factors like failure correlation, architecture capabilities, and the potential to continue operations. The author also highlights the importance of blast radius reduction techniques.

Saved by tldr-importer · Last saved February 14, 2026 · 3 min read

+ error-handling system-design ✓ + cloud-computing + rust resilience ✓

The ‘Super Bowl’ standard: Architecting distributed systems for massive concurrency

This article discusses strategies for building resilient distributed systems that can handle extreme traffic spikes, like those seen during major events. It highlights four key architectural patterns: load shedding, isolation through bulkheads, request collapsing, and conducting game day rehearsals to ensure systems can withstand high demand without crashing.

Saved by tldr-importer · Last saved February 14, 2026 · 5 min read

+ load-shedding system-design ✓ + concurrency resilience ✓ + stress-testing

Building a Resilient Event Publisher with Dual Failure Capture

Klaviyo has developed a resilient event publisher using a dual failure capture design to ensure that no incoming events are lost during processing, even amidst network issues or serialization errors. By integrating Kafka topics and S3 for backup, the system can efficiently handle failures and maintain real-time event publishing for its customers. The implementation has proven effective, with significant automatic retries and event recovery in production.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ event-publishing resilience ✓ + kafka + data-recovery system-design ✓