An optimized Triton BF16 Grouped GEMM kernel is presented, achieving up to 2.62x speedup over the manual PyTorch implementation for Mixture-of-Experts (MoE) models like DeepSeekv3 on NVIDIA H100 GPUs. The article details several optimization techniques, including persistent kernel design, grouped launch ordering for improved cache performance, and efficient utilization of the Tensor Memory Accelerator (TMA) for expert weights. End-to-end benchmarking results demonstrate significant improvements in training throughput.