2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Netflix engineers presented a centralized platform for managing data deletion across various storage systems while ensuring durability, availability, and correctness. The platform has successfully deleted 76.8 billion rows without data loss, addressing challenges like data resurrection and resource spikes during deletion. Key recommendations emphasize the importance of rigorous validation and centralized monitoring.
If you do, here's more
Netflix engineers Vidhya Arvind and Shawn Liu introduced a centralized data-deletion platform at QCon San Francisco, tackling a complex challenge in distributed system design. The platform efficiently manages deletions across diverse data stores, emphasizing durability, availability, and correctness. It has successfully processed 76.8 billion row deletions across 1,300 datasets without any incidents of data loss. The need for such a system stems from the legal risks associated with data retention under regulations like GDPR and the operational challenges posed by the accumulation of "garbage" data from frequent production tests.
The architecture addresses the difficulties of deleting data from various storage engines, each with unique characteristics. For instance, Cassandra’s background compaction can incur CPU costs, while Elasticsearch has high resource demands due to its eventual segment merging. The platform also mitigates issues like data resurrection, where deleted items might reappear due to misconfigurations or synchronization problems, a phenomenon the team termed "the ghost in the machine." To ensure effective deletions, Netflix's approach relies on three key principles: durability, availability, and correctness, which are supported by a robust architecture that includes control planes, audit jobs, and monitoring systems.
Safeguards during bulk deletions are critical for maintaining system resilience. Netflix employs backpressure mechanisms to slow down operations when resource utilization is high and implements rate limiting that gradually increases request capacity based on system health. Their monitoring tracks essential metrics to ensure the deletion process operates smoothly. The platform’s success is visible in its daily deletion rates, which exceed 3 million rows, and the absence of data loss incidents. Key recommendations from the team highlight the importance of auditing, centralized management, and resilience techniques to build trust and ensure reliable data handling. The inception of this platform was influenced by a previous incident that caused significant data loss, driving the need for a more robust and well-structured approach to data deletion.
Questions about this article
No questions yet.