Quit Emailing Yourself

Enabling Trillion-Parameter Models on AWS EFA

6 min read | Saved February 14, 2026 | Copied!

Do you care about this?

This article discusses the challenges and solutions for deploying large Mixture-of-Experts models on AWS using Elastic Fabric Adapter technology. It details the development of new inter-node kernels that improve performance and reduce latency for these complex models. The authors explain the technical aspects of their implementation and how it enhances cloud-based model deployment.

If you do, here's more

Perplexity has developed innovative kernels to enable trillion-parameter models on AWS Elastic Fabric Adapter (EFA). These models, like Kimi-K2, require complex deployments due to their size. Traditional single-node setups with NVIDIA H200 GPUs aren't sufficient, necessitating a move to multi-node configurations. The new kernels achieve low latency for expert parallelism on EFA, outperforming previous solutions like DeepEP. By optimizing communication between nodes, they allow efficient inference for the largest open-source models.

Mixture-of-Experts (MoE) architecture is key for scaling models effectively. It replaces dense transformer layers with a set of experts, allowing for parallel processing. MoE routing involves complex peer-to-peer communication that existing libraries struggle to handle. Perplexity's work focuses on creating specialized kernels that address these challenges, particularly in inter-node deployments over InfiniBand and EFA. The article details how these new kernels improve performance by utilizing hybrid CPU-GPU architectures, enabling efficient token dispatch and processing.

The design of the dispatch and combine kernels is critical. They are split into sender and receiver halves to maximize throughput and minimize latency through micro-batching. This setup allows GPUs to perform other tasks while waiting for data transfers. The use of unified memory and GDRCopy further enhances communication efficiency, enabling faster data handling. By leveraging both RDMA and NVLink for in-node communication, the new kernels effectively reduce the overhead associated with large model deployments and improve overall performance.

Questions about this article

No questions yet.