6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
The author reflects on their experience during the recent Cloudflare outage, highlighting how system limits and complex failures can lead to unexpected problems. They emphasize the importance of understanding the context behind decisions made during incidents and the value of detailed incident writeups for learning.
If you do, here's more
At QCon SF, the author reflected on the recent Cloudflare outage while hosting a session. The outage stemmed from a software limit on the feature file size, which was exceeded, causing a failure. The discussion centers on the concept of saturation in complex systems—how every system has limits that, when breached, lead to unexpected behaviors. For instance, in Cloudflare's case, their Bot Management system had a hard limit of 200 machine learning features, far above their typical usage of around 60. However, an explicit limit set on the system led to a panic due to an unhandled error when the system tried to process an oversized file.
The writeup also highlights the confusion during the incident. Operators initially suspected a DDoS attack due to sporadic error spikes and the unusual recovery pattern of the system. Their status page, crucial for communication during outages, went down around the same time, further clouding the diagnosis. This coincidence misled the team into thinking they were under external attack. The author emphasizes that even experienced operators can struggle with complex failure modes that don’t align with previous experiences, especially when symptoms fluctuate unpredictably.
Another key point is the importance of understanding the context behind coding decisions. The author urges against jumping to conclusions about programmer negligence without grasping the rationale behind their choices. Assuming local rationality helps in understanding how incidents occur and prevents a disconnect that limits learning. Finally, the detailed public writeup from Cloudflare, which included a code snippet and was signed by the CEO, stands out as an example of transparency. Such depth in incident documentation is rare and demonstrates a commitment to learning from failures rather than hiding from them.
Questions about this article
No questions yet.