6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article introduces Tensor R-Fork, a method for quickly loading model weights in SGLang instances using GPU-Direct RDMA. It significantly reduces loading times and storage requirements while allowing uninterrupted inference services. The article details the implementation using two backends: NCCL and TransferEngine.
If you do, here's more
The article introduces Tensor R-Fork, a method developed by the Ant Group DeepXPU and SGLang teams to accelerate the loading of large model weights in machine learning applications. The method addresses the slow cold-start times that arise when scaling large language models (LLMs) like Deepseek-R1. Traditional weight loading from local disks or remote storage can take several minutes or even tens of minutes. Tensor R-Fork reduces this loading time to mere seconds, cutting down local storage requirements by about 600GB while maintaining inference service quality.
At the core of Tensor R-Fork is a peer-to-peer weight storage architecture that utilizes GPU-Direct RDMA. This allows weight tensors to be transferred directly between GPUs without passing through the CPU or host memory, which typically creates bottlenecks. The implementation includes two backends: NCCL and TransferEngine. While NCCL is straightforward to set up, it disrupts ongoing inference services due to the need for communication group establishment and CUDA kernel execution. In contrast, TransferEngine allows for non-disruptive weight transfers by running alongside each tensor parallel worker, registering GPU memory addresses without invoking CUDA kernels.
The article provides detailed usage instructions for both backends, highlighting the commands needed to set up seed and client instances. The NCCL backend is simpler but comes with trade-offs in terms of service disruption. TransferEngine, while requiring an additional library, enables efficient data movement without affecting the source instance's performance. This advancement represents a significant step in optimizing weight loading for large models, which is increasingly important as model sizes and demands grow.
Questions about this article
No questions yet.