Click any tag below to further narrow down your results
Links
Novita AI presents a series of optimizations for the GLM4-MoE models that enhance performance in production environments. Key improvements include a 65% reduction in Time-to-First-Token and a 22% increase in throughput, achieved through techniques like Shared Experts Fusion and Suffix Decoding. These methods streamline the inference pipeline and leverage data patterns for faster code generation.