6 min read
|
Saved October 29, 2025
|
Copied!
Do you care about this?
An optimized Triton BF16 Grouped GEMM kernel is presented, achieving up to 2.62x speedup over the manual PyTorch implementation for Mixture-of-Experts (MoE) models like DeepSeekv3 on NVIDIA H100 GPUs. The article details several optimization techniques, including persistent kernel design, grouped launch ordering for improved cache performance, and efficient utilization of the Tensor Memory Accelerator (TMA) for expert weights. End-to-end benchmarking results demonstrate significant improvements in training throughput.
If you do, here's more
Click "Generate Summary" to create a detailed 2-4 paragraph summary of this article.
Questions about this article
No questions yet.