Click any tag below to further narrow down your results
Links
AWS faced a major outage on October 19-20 due to a race condition in DynamoDB’s DNS management, disrupting multiple services in the Northern Virginia region. While the incident was brief, many customers experienced issues for up to 15 hours, prompting discussions on AWS reliability and future improvements.
The author reflects on their experience during the recent Cloudflare outage, highlighting how system limits and complex failures can lead to unexpected problems. They emphasize the importance of understanding the context behind decisions made during incidents and the value of detailed incident writeups for learning.
This article analyzes the November 2025 outage that took down major websites, including Cloudflare, due to a configuration error. It explains how a small change in a configuration file led to a cascading failure across multiple services and provides strategies to prevent similar incidents in the future.
Cloudflare experienced another major outage that lasted 25 minutes, affecting 28% of its HTTP traffic. The outage stemmed from a global configuration change intended to fix a React vulnerability, which led to HTTP 500 errors across its network. This incident follows a similar outage just weeks prior, raising concerns about Cloudflare's reliability.
The article critiques Cloudflare's response to a recent global outage, highlighting flaws in their root cause analysis that overlook fundamental database issues. It argues that the outage stems from a mismatch between application logic and database schema, suggesting that Cloudflare needs to focus on logical design rather than just physical replication to prevent future incidents.
A software update at Snowflake led to a 13-hour outage affecting 10 global regions, preventing customers from querying data or ingesting files. The issue stemmed from a backward-incompatible database schema change, which created version mismatch errors across the platform.
This article examines a recent AWS DynamoDB outage caused by a latent race condition in the DNS management system. It discusses how applying System-Theoretic Process Analysis (STPA) could have identified potential issues before the outage occurred, highlighting the importance of proactive analysis in software reliability.
Cloudflare experienced a widespread outage due to an update to its Web Application Firewall meant to address a vulnerability in React Server Components. The fix caused issues for various enterprise and consumer services, highlighting the risks of relying on single service providers.
On December 5, 2025, Cloudflare experienced a significant outage lasting about 25 minutes due to a configuration change related to their Web Application Firewall. The issue arose from a bug triggered when turning off a testing tool, resulting in HTTP 500 errors for around 28% of customer traffic. Cloudflare is implementing measures to prevent similar incidents in the future.
On November 18, 2025, Cloudflare experienced a significant outage due to a change in database permissions that led to an oversized feature file for their Bot Management system. This caused widespread HTTP 5xx errors across various services until the issue was resolved later that day. The article details the incident, its impact, and steps for future prevention.
Microsoft Azure experienced a significant outage due to an inadvertent configuration change, impacting services like Microsoft 365, Xbox, and various companies including Capital One and Starbucks. Recovery efforts are underway, with many services reporting improvements, though some customers still face issues.
A Cloudflare outage on Tuesday affected major platforms like X and ChatGPT due to a spike in unusual traffic. The issue stemmed from a configuration file that exceeded its size limit, causing a software crash. Cloudflare confirmed there was no malicious activity involved.
BridgePay Network Solutions confirmed a ransomware attack has disrupted its payment gateway, leading to widespread service outages across the U.S. Merchants reported being unable to process card payments, forcing many to accept cash only. The company is working with federal law enforcement and forensic teams, asserting that no payment card data was compromised.
Cloudflare experienced a significant outage due to a bad configuration, impacting many popular apps and services. This incident exposes the risks of centralization in internet infrastructure and emphasizes the need for more redundancy and resilience in our digital systems.
Shopify experienced a significant outage on Cyber Monday, affecting some merchants' operations during peak shopping hours. While the company acknowledged the issue on social media, it assured customers they could still browse and purchase items online. The outage temporarily impacted administration tools like point-of-sale systems.
This article discusses the implications of Downdetector relying on Cloudflare for key services during a November 2025 outage. Despite being a multi-cloud service, Downdetector's use of Cloudflare for DNS and CDN helps manage traffic spikes and maintain performance, even if it introduces a single point of failure. The piece also highlights design considerations and potential improvements for the future.
Cloudflare experienced a global outage, impacting access to many websites and services. The issue stemmed from a configuration file that exceeded its size limit, causing a crash in the system that manages traffic. Although the outage was resolved within a few hours, it highlighted the vulnerability of internet infrastructure.
The author reflects on shutting down their self-hosted git server after years of operation due to overwhelming traffic from AI scrapers. They’ve redirected users to larger git hosting services like GitLab and GitHub and now only maintain a personal blog on a static site.
Cloudflare faced a global outage due to a database permission update that caused 5xx errors across its services. The issue stemmed from a regression that led to duplicate data in the Bot Management system, overwhelming memory limits and crashing the service. Cloudflare has since restored service and is reviewing its systems to prevent similar issues.
The article discusses a significant service outage that occurred at Cloudflare on June 12, 2025, affecting numerous websites and services globally. It details the causes of the outage, including technical failures and their impact on users and businesses. Additionally, the company outlines measures taken to prevent similar incidents in the future.
Linktree has mysteriously gone dark in India, leaving users and the company puzzled about the reasons behind this sudden service disruption. Despite attempts to understand the situation, Linktree has not provided a clear explanation for the outage.
A DNS race condition in Amazon's DynamoDB system caused a significant outage that disrupted major websites and services, resulting in potential damages reaching hundreds of billions of dollars. The issue stemmed from a failure in the automated DNS management system, leading to widespread DNS failures and affecting various AWS services. Amazon has since disabled the affected systems and is working to implement safeguards against a recurrence.
An AWS outage caused significant disruptions to various popular services, including Alexa, Fortnite, and Snapchat, leaving many users unable to access these platforms. The incident highlights the reliance on cloud services and the potential impact of downtime on everyday activities and businesses.
Microsoft is investigating a global outage affecting access to the Exchange Admin Center, designated as a critical service issue. Administrators are encountering "HTTP Error 500" when trying to log in, but some have found a workaround via a different URL. Microsoft is working on solutions and has started redirecting traffic to restore access temporarily.
A significant AWS outage on October 19-20, 2025, caused by a DNS failure in the DynamoDB API, led to widespread disruptions across over 140 AWS services, affecting major platforms and clients. The incident highlights the importance of observability in quickly detecting and resolving such failures, emphasizing that organizations using Full-Stack Observability can mitigate financial losses and improve response times during outages. Effective monitoring and real-time visibility into service impacts are crucial for managing risks in cloud environments.
Microsoft is addressing an outage affecting its Azure Front Door CDN, which has disrupted access to various Microsoft 365 services across Europe, Africa, and the Middle East. As of the latest updates, the company has restored approximately 98% of the service and is actively monitoring for full recovery, with the outage affecting only about 4% of previously impacted customers. The incident has been officially mitigated, and users have reported resolution of access issues.
Amazon's cloud service, AWS, experienced a significant outage affecting numerous popular websites and applications, including Snapchat and Reddit. While services have returned to normal, a backlog of messages is still being processed, highlighting the vulnerabilities in the reliance on a few major cloud providers.
Cellcom has confirmed that a week-long service disruption affecting voice and text services in Wisconsin and Upper Michigan was caused by a cyberattack. The company is working with cybersecurity experts to investigate the incident, and while some services are being restored, there is no evidence that customer data was compromised.
A massive outage at Amazon Web Services (AWS) on October 20, 2025, caused widespread disruptions to various internet services globally, affecting numerous businesses and users. The incident highlighted the reliance on cloud services and raised concerns over their stability and resilience. Users experienced significant interruptions, leading to discussions about the implications for digital infrastructure.
Amazon Web Services experienced a significant outage on Monday, affecting numerous major websites including Disney+, Reddit, and United Airlines. Although most services were restored within hours, the outage highlighted the fragility of reliance on major cloud providers, with AWS confirming it was caused by DNS issues related to its DynamoDB service.
AWS experienced a significant outage on October 20, primarily due to DNS issues linked to the departure of senior engineers, leading to concerns about the company's diminishing institutional knowledge. As a result, many internet services were disrupted, highlighting the potential consequences of a talent drain within AWS. The situation raises questions about the company's ability to handle future incidents with a less experienced workforce.
The article discusses an outage affecting services provided by GCP (Google Cloud Platform), Cloudflare, and Anthropic, highlighting the implications for users and businesses reliant on these platforms. It examines the causes of the outage and its impact on cloud computing reliability and security.
A significant incident occurred on July 14, 2025, involving Cloudflare's 1.1.1.1 DNS service, leading to widespread internet disruptions. The article details the nature of the incident, its impact on users, and the steps taken by Cloudflare to resolve the issues.
The article discusses the recent Google Cloud outage, detailing its causes, effects on businesses and users, and the broader implications for cloud reliability. It emphasizes the consequences of such disruptions on critical operations and highlights the need for better contingency planning in cloud services.
On April 16, 2025, Spotify experienced a global outage due to a bug triggered by a change in the order of Envoy Proxy filters, leading to simultaneous crashes of all Envoy instances. The incident caused a significant traffic disruption, except in the Asia Pacific region, and was eventually mitigated by increasing server capacity and addressing configuration issues. Spotify has outlined steps to prevent similar outages in the future, including bug fixes and improvements in their rollout and monitoring processes.
Cloudflare experienced a significant outage on September 12, 2023, affecting both their dashboard and API services. The incident caused disruptions for users relying on these tools, leading to increased scrutiny of the company's infrastructure and response mechanisms during downtime. Cloudflare's team worked to resolve the issues and restore services as quickly as possible.
A single software bug in Amazon's DynamoDB DNS management system caused a significant outage of Amazon Web Services, affecting millions globally for over 15 hours. The failure stemmed from a race condition triggered by the interaction of two components within the system, which led to widespread service disruptions reported by thousands of organizations.
Amazon Web Services resolved a significant outage that affected over 1,000 apps and websites, including Snapchat and major banks, highlighting the risks of relying heavily on a single cloud provider. Experts emphasized the need for companies to build more resilient systems and questioned the sustainability of the current concentration of cloud services among a few major players. The outage, attributed to DNS resolution issues, sparked discussions on the vulnerabilities in the infrastructure of online services.
Amazon's AWS experienced a significant outage due to a major DNS failure linked to a race condition within DynamoDB's infrastructure, affecting users globally for over 14 hours. The incident led to the accidental deletion of all IP addresses for the database service's regional endpoint, causing widespread connectivity issues. In response, Amazon has implemented measures to prevent future occurrences and apologized for the disruption caused to customers.
The article critiques popular misconceptions surrounding the recent AWS outage, emphasizing that it was not caused by AI and highlighting the pitfalls of adopting a multi-cloud strategy. It discusses the complexities of maintaining cloud systems and the importance of understanding the root causes of outages rather than relying on simplistic explanations or excuses.
An outage at Amazon Web Services left users of Eight Sleep's smart mattresses unable to access temperature controls, resulting in uncomfortable nights. Customers reported waking up distressed as they lost access to the app that regulates their sleep environment. The incident highlighted the vulnerabilities of smart home technology reliant on cloud services.
The article discusses a significant 14-hour outage in the AWS us-east-1 region that affected 140 services, primarily due to a race condition in the DynamoDB DNS management system. The author analyzes the outage's causes and implications, emphasizing the interconnectedness of AWS services and the unexpected nature of such failures in a highly reliable cloud platform.