2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
AWS faced a major outage on October 19-20 due to a race condition in DynamoDB’s DNS management, disrupting multiple services in the Northern Virginia region. While the incident was brief, many customers experienced issues for up to 15 hours, prompting discussions on AWS reliability and future improvements.
If you do, here's more
On October 19th and 20th, AWS experienced a significant outage due to a failure in Amazon DynamoDB, affecting a wide range of services in the Northern Virginia region. The issue stemmed from a latent race condition in the automated DNS management system for DynamoDB, which led to endpoint resolution failures. This failure caused disruptions for services that depend on DynamoDB, including EC2 instance launches, Lambda invocations, and Fargate tasks. Some customers faced issues lasting up to 15 hours, despite the incident report focusing on shorter, specific time frames.
AWS acknowledged the impact of this reliability issue in their post-mortem analysis and has begun implementing changes to address the root cause. They disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide to fix the race condition and prevent incorrect DNS plans from being applied. Additional measures include enhancing the throttling mechanism in EC2 data propagation to manage workloads better during peak times. Experts in the field, like Yan Cui and Jeremy Daly, pointed out that while the outage captured widespread attention, it’s essential to consider AWS's historical reliability, which has generally remained high, at around 99.84% over the past year.
Roman Siewko and Mudassir Mustafa emphasize that while the recent outage was notable, it should not overshadow the overall reliability AWS has maintained. They argue that the tech community might overreact to such rare incidents, leading to hasty decisions without a full understanding of AWS's track record. The broader conversation touches on the importance of balancing immediate responses to outages with the ongoing, often unnoticed work that ensures long-term uptime and stability.
Questions about this article
No questions yet.