3 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article explains how Meta is using backend aggregation (BAG) to connect thousands of GPUs across multiple data centers for its Prometheus AI cluster. BAG facilitates high-capacity networking, enabling the infrastructure to meet the demands of large-scale AI applications. It details the technical aspects of BAG's design and implementation, emphasizing performance and reliability.
If you do, here's more
Meta is implementing backend aggregation (BAG) to build its gigawatt-scale AI cluster, Prometheus. BAG connects thousands of GPUs across multiple data centers, providing the necessary infrastructure for advanced AI experiences. It integrates two network fabrics: Disaggregated Schedule Fabric (DSF) and Non-Scheduled Fabric (NSF). Once fully operational, Prometheus will deliver a capacity of 1 gigawatt, interlinking tens of thousands of GPUs across several data center buildings.
BAG acts as a centralized Ethernet-based network layer that connects various spine layer fabrics, supporting immense bandwidth needs with capacities reaching up to 48 petabits per region. To manage the interconnection of these GPUs, Meta is deploying regional BAG layers while adhering to specific distance and latency constraints. Connectivity options include planar and spread connection topologies, each offering distinct management and resilience benefits.
Metaβs BAG implementation uses modular hardware, relying on Jericho3 ASIC line cards for high-capacity connections. Routing employs eBGP and Unequal Cost Multipath for effective load balancing. The design prioritizes resilience, with detailed plans for failure management and IP addressing, ensuring minimal downtime. By centralizing regional network connections, BAG not only enhances the functionality of Prometheus but is also set to support Meta's future AI infrastructure demands.
Questions about this article
No questions yet.