4 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Novita AI presents a series of optimizations for the GLM4-MoE models that enhance performance in production environments. Key improvements include a 65% reduction in Time-to-First-Token and a 22% increase in throughput, achieved through techniques like Shared Experts Fusion and Suffix Decoding. These methods streamline the inference pipeline and leverage data patterns for faster code generation.
If you do, here's more
Novita AI's recent optimizations for deploying GLM4-MoE models using SGLANG have achieved significant performance improvements. Notably, they reduced Time-to-First-Token (TTFT) by up to 65% and improved Time-Per-Output-Token (TPOT) by 22% under agentic coding workloads. These enhancements come from a suite of strategies that streamline the inference pipeline, addressing issues from kernel execution to data transfer. All tests were validated on H200 clusters using TP8 and FP8 configurations, demonstrating the practical applicability of these optimizations.
Key optimizations include Shared Experts Fusion, which merges the functionality of separate processing routes for input tokens, improving Streaming Multiprocessor (SM) utilization and reducing memory overhead. Qknorm Fusion combines head-wise computations into a single kernel tailored for GLM4-MOE, enhancing efficiency. Async Transfer optimizes data movement by scheduling transfers immediately after GPU operations, significantly cutting down TTFT, particularly for models with numerous kernel launches.
Suffix Decoding offers a model-free approach to speculative decoding by leveraging patterns from previous outputs. This method is efficient; it capitalizes on the high frequency of output pattern repetition observed in agentic coding tasks. In testing, Suffix Decoding reduced TPOT from 25.13 ms to 19.63 ms, showing the impact of recognizing and utilizing historical context in generating new tokens. Novita AI has made the evaluation dataset available for further research, reflecting their commitment to transparency and collaboration in advancing AI performance.
Questions about this article
No questions yet.