Click any tag below to further narrow down your results
Links
This article introduces Tensor R-Fork, a method for quickly loading model weights in SGLang instances using GPU-Direct RDMA. It significantly reduces loading times and storage requirements while allowing uninterrupted inference services. The article details the implementation using two backends: NCCL and TransferEngine.