Quit Emailing Yourself

Thread by @ZhihuFrontier on Thread Reader App

4 min read | Saved February 14, 2026 | Copied!

quantization 🤖 llm 🤖 inference 🤖 k2-thinking 🤖 rl 🤖

Do you care about this?

This article explores the significance of INT4 quantization in large language models (LLMs). It discusses how K2-Thinking's approach optimizes inference speed and stability while minimizing precision loss, making low-bit quantization a standard in model training.

If you do, here's more

The article details the significance of the native INT4 quantization format introduced with K2-Thinking, emphasizing its role in optimizing large language models (LLMs). Liu Shaowei, an infrastructure engineer at Kimi-Moonshot, explains that quantization is evolving beyond a simple trade-off between precision and speed. With advancements in parameter scaling and test-time scaling, low-bit quantization is becoming standard in large model training. The Kimi-K2 architecture, which utilizes a mixture of experts (MoE) structure, benefits from reduced model weight sizes to enhance decoding speed, particularly when switching from FP8 to W4A16, which sharply decreases latency without sacrificing quality.

The article contrasts post-training quantization (PTQ) with quantization-aware training (QAT). While PTQ suits shorter tasks, it struggles with longer reasoning due to error accumulation and calibration data issues. K2-Thinking employs QAT to maintain precision and stability in long-context reasoning. The implementation of QAT was quick, integrating training and inference seamlessly to achieve near lossless results. Another advantage of the INT4 format is its impact on reinforcement learning (RL) training, making it faster and more stable due to its lower latency, which is crucial for efficient rollouts.

Kimi's choice of INT4 over other formats like MXFP4 is based on compatibility with various GPUs and solid kernel support. At a quant scale of 1×32, INT4 competes with FP4 in expressiveness while being more adaptable to different hardware setups. The article hints at future developments, suggesting that W4A16 is just the starting point, with W4A8 and W4A4 on the horizon as new chip technologies emerge. The narrative positions quantization not merely as a technical adjustment but as a foundational element in advancing LLM capabilities.

Questions about this article

No questions yet.