Quit Emailing Yourself

Replication redefined: How we built a low-latency, multi-tenant data replication platform | Datadog

6 min read | Saved February 14, 2026 | Copied!

data-replication 🤖 postgresql 🤖 kafka 🤖 debizium 🤖 automation 🤖

Do you care about this?

This article details Datadog's approach to creating a managed data replication platform that improves data movement across services. It covers technical challenges faced with a shared Postgres database, the transition to a dedicated search platform, and the use of asynchronous replication to enhance scalability and reliability.

If you do, here's more

Datadog operates thousands of services that require fast and reliable data access, making data movement between systems a complex task. Traditional replication methods like manual pipelines and custom scripts become unmanageable as the number of data sources increases. To tackle this, Datadog developed a managed data replication platform that abstracts the operational overhead for engineering teams and allows for flexible data movement. The platform is designed to provide robust monitoring and alerting while adapting to new use cases without needing extensive rework of the underlying infrastructure.

Initially, the team used a shared Postgres database for their key product pages, but as data volumes grew, they hit scaling limits. For instance, queries for the Metrics Summary page involving 82,000 active metrics and 817,000 configurations resulted in page latencies approaching 7 seconds. This led to the decision to decouple search functions from the main database, routing them to a dedicated search platform. This change drastically improved performance, reducing page load times by up to 97% and maintaining a replication lag around 500 ms.

As the platform expanded, the complexity of managing replication pipelines increased. Each pipeline required multiple steps, from setting up logical replication on Postgres to configuring Kafka for data streaming. To streamline this, Datadog implemented automation using Temporal workflows, allowing teams to create and manage pipelines more efficiently. This approach not only improved operational consistency but also freed developers from repetitive tasks, enabling them to focus on innovation.

A key architectural decision involved choosing between synchronous and asynchronous replication. Synchronous replication ensures data consistency but introduces latency and complexity at scale. In contrast, asynchronous replication allows for immediate acknowledgment of writes, making it more suitable for Datadog's high-throughput environment. By opting for asynchronous replication, they prioritized scalability and resilience, allowing them to manage large data volumes while maintaining robust data movement across their services.

Questions about this article

No questions yet.