Quit Emailing Yourself

How Klaviyo Built an Unbreakable System for Running Distributed ML Workloads | by Cayla Schuval | Klaviyo Engineering

7 min read | Saved February 14, 2026 | Copied!

distributed-ml 🤖 ray 🤖 kubernetes 🤖 job-management 🤖 system-architecture 🤖

Do you care about this?

This article details how Klaviyo developed DART Jobs, a system that simplifies running distributed machine learning tasks using the Ray framework. It highlights the architecture, including the DART Jobs API, central database, and sync service, which together ensure reliable job management across multiple Kubernetes clusters.

If you do, here's more

Klaviyo's DART Jobs system leverages Ray, an open-source framework designed for scaling Python applications, to manage distributed machine learning workloads. Ray simplifies the process of parallelizing tasks, making it easy for developers to convert local functions into distributed ones with minimal changes. It offers unified resource management, allowing seamless scaling from local environments to multi-node cloud clusters. Ray’s APIs hide the complexities of distributed computing, such as networking and fault tolerance, enabling developers to focus on their core tasks.

DART Jobs minimizes the burden on users by handling cluster setup, permissions, and configurations. Initially, it operated within a single Kubernetes cluster, which made job management straightforward. However, as the demand increased, Klaviyo transitioned to a four-cluster architecture, introducing challenges with API request management. Without specialized tooling, follow-up actions like stopping jobs could misroute requests, complicating interactions.

To tackle these scaling and consistency challenges, Klaviyo implemented a layered architecture comprising the DART Jobs API Server, a central MySQL database, and a DART Jobs Sync Service. The API Server validates job submissions and records them in the database, which serves as the single source of truth. The Sync Service monitors the database for job updates and manages the interaction with Kubernetes and Ray, ensuring that jobs run smoothly across clusters while maintaining accurate state information. This structure not only streamlines the workflow for developers but also enhances system reliability.

Questions about this article

No questions yet.