Quit Emailing Yourself

Keeping 20,000 GPUs healthy

6 min read | Saved February 14, 2026 | Copied!

gpu 🤖 reliability 🤖 cloud-computing 🤖 healthchecking 🤖 performance 🤖

Do you care about this?

The article details Modal's approach to maintaining the health of over 20,000 GPUs across various cloud providers. It covers instance selection, machine image preparation, boot checks, and ongoing health monitoring to ensure performance and reliability. The insights aim to guide others in effectively utilizing cloud GPUs.

If you do, here's more

Modal operates a massive GPU worker pool with over 20,000 concurrent GPUs from major cloud providers like AWS, GCP, Azure, and OCI. At this scale, they encounter various reliability issues. The article outlines their GPU reliability system, emphasizing importance for customers and others renting cloud GPUs. Key topics include cloud instance type selection, machine image preparation, instance boot checks, and ongoing healthchecks.

The performance and reliability of GPU instances differ significantly among cloud providers. For instance, Cloud A has a reliable instance launch API, but its H100 GPUs perform 50% worse on certain tasks compared to competitors. Cloud C struggled with overheating issues, while Cloud D offers the best price/performance but has problems with hardware slowdowns. Modal uses a semi-automated benchmarking tool to assess these differences and adjust their internal pricing based on reliability issues.

Machine images are crucial for performance and reliability. Modal emphasizes consistency and up-to-date drivers across their multi-cloud setup. They switched to continuous integration with automated testing for machine images, which allows them to catch issues early. The article highlights that while most cloud providers are reliable with machine images, Cloud D suffers from slow image replication across regions. During instance boot, Modal performs light checks to ensure no faulty GPUs are in use. They have mostly eliminated GPU problems reaching user containers, although some specific GPUs from Cloud C still have occasional issues during initialization.

Once instances are running, Modal employs both passive and active healthchecks to monitor their performance. Passive checks gather data from logs and health reports without affecting GPU performance, while active checks involve direct testing. They prioritize common issues, such as thermal violations and hardware slowdowns, to maintain reliability. The article illustrates how Modal’s detailed monitoring and testing processes help them manage a vast and complex GPU infrastructure effectively.

Questions about this article

No questions yet.