Quit Emailing Yourself

# nvidia → optimization → gemm

1 link tagged with all of: nvidia + optimization + gemm

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

An optimized Triton BF16 Grouped GEMM kernel is presented, achieving up to 2.62x speedup over the manual PyTorch implementation for Mixture-of-Experts (MoE) models like DeepSeekv3 on NVIDIA H100 GPUs. The article details several optimization techniques, including persistent kernel design, grouped launch ordering for improved cache performance, and efficient utilization of the Tensor Memory Accelerator (TMA) for expert weights. End-to-end benchmarking results demonstrate significant improvements in training throughput.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ triton gemm ✓ optimization ✓ nvidia ✓ + machine-learning

Links

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel