Quit Emailing Yourself

Slonk: Slurm on Kubernetes for ML Research at Character.ai

4 min read | Saved February 14, 2026 | Copied!

slurm 🤖 kubernetes 🤖 ml-research 🤖 infrastructure 🤖 open-source 🤖

Do you care about this?

This article explains Slonk, a system developed at Character.ai that combines SLURM and Kubernetes to manage GPU research clusters effectively. It addresses the challenges of providing a reliable scheduling environment for researchers while maintaining the operational benefits of Kubernetes. The open-source snapshot offers tools and configurations for others to implement similar systems.

If you do, here's more

Slonk, short for Slurm on Kubernetes, is a solution developed by Character.ai to manage GPU research clusters. It combines the familiar SLURM scheduling interface with Kubernetes orchestration, addressing the needs of both researchers and infrastructure teams. Researchers seek the reliability and simplicity of SLURM for job scheduling, while operations teams require Kubernetes for its orchestration capabilities, health checks, and autoscaling. Slonk provides a unified approach that allows researchers to work as they usually do, using commands like `sbatch`, while benefiting from Kubernetes’ resilience and automation.

The architecture of Slonk treats SLURM nodes as long-running Kubernetes pods, leveraging StatefulSets for controllers, workers, and login nodes. Each SLURM node corresponds to a Kubernetes pod, facilitating seamless integration and management. The system includes a lightweight base image for each pod, with shared configurations and persistent storage. Health checks are crucial, as they ensure that any failing components are automatically addressed, maintaining operational stability. SLURM’s topology-aware scheduler optimizes GPU allocation by co-locating resources, which speeds up job execution significantly.

Technical challenges include synchronizing the states of SLURM and Kubernetes, which requires custom utilities to ensure both systems agree on resource availability. Health checks are implemented to detect and manage faulty nodes, while a Kubernetes operator enforces the desired state across the cluster. The approach simplifies management of GPU resources, enabling dynamic allocation between training and inference tasks. Character.ai provides an open-source snapshot of Slonk, including Helm charts, health-check scripts, and an operator for lifecycle management, inviting others to adapt and build on this implementation.

Questions about this article

No questions yet.