5 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article discusses a system built for Wayfair that uses PostgreSQL as a Dead Letter Queue (DLQ) to manage failed event processing. Instead of using Kafka for failed events, the system stores them in a PostgreSQL table, allowing for better visibility and easier reprocessing. It also outlines a retry mechanism with exponential backoff to prevent flooding the DLQ with transient failures.
If you do, here's more
While working on a report generation system for Wayfair, the author faced the inevitable challenges of handling failures in a distributed architecture. The system relied on Kafka consumers to enrich events from various sources before persisting them in CloudSQL PostgreSQL. When things went wrong—such as API failures, consumer crashes, or malformed events—the team needed a reliable way to manage these errors without losing critical data.
To address this, they opted for a Dead Letter Queue (DLQ) approach using PostgreSQL instead of Kafka. Storing failed events in a dedicated DLQ table allowed for easier inspection and reprocessing. Each failed event was recorded with a status field indicating its lifecycle, making it simpler to manage retries. The DLQ table was designed for easy querying, with a schema that included fields for event type, payload, error reason, and timestamps. Indexes were implemented to facilitate efficient searches and audits.
The retry mechanism employed ShedLock to ensure that only one instance processed the DLQ at a time, preventing duplicate efforts. Events were retried every six hours, with a maximum of 50 at a time. To avoid flooding the DLQ with transient failures, the system included an exponential backoff strategy, gradually increasing wait times between retries. This setup improved operational stability, allowing the system to recover from temporary issues while ensuring that only genuinely problematic messages reached the DLQ for manual review.
Questions about this article
No questions yet.