Quit Emailing Yourself

Differential Transformer V2

7 min read | Saved February 14, 2026 | Copied!

differential-transformer 🤖 attention 🤖 deep-learning 🤖 optimization 🤖 training-stability 🤖

Do you care about this?

This article details the enhancements in Differential Transformer V2 (DIFF V2) over its predecessor. It focuses on the architecture's efficiency gains during decoding and training stability, achieved by adjusting query heads and eliminating certain normalization layers. Experimental results show reduced loss and gradient spikes in large language model training.

If you do, here's more

Differential Transformer V2 (DIFF V2) represents an advancement over its predecessor, DIFF V1, focusing on efficiency in large language model (LLM) decoding. By doubling the number of query heads while keeping the key-value (KV) heads constant, DIFF V2 enhances decoding speed without increasing memory usage. This design choice allows for fast processing, aligning with standard Transformers. The architecture avoids the need for custom attention kernels, which plagued DIFF V1 and slowed down its performance due to the necessity of loading the value cache multiple times.

Key improvements in DIFF V2 include eliminating the per-head RMSNorm, which had led to instability and massive gradients during training in DIFF V1. By removing this layer, DIFF V2 achieves a more stable gradient norm comparable to that of standard Transformers. The model also introduces a mechanism to manage the context Root Mean Square (RMS), effectively addressing issues related to attention sinks that can destabilize training. In preliminary experiments, DIFF V2 shows lower language modeling loss—between 0.02 to 0.03 at one trillion training tokens—along with reduced activation outliers and gradient spikes, particularly under large learning rates.

The architecture’s efficiency extends to parameter management. By explicitly constructing the differential operation, DIFF V2 saves approximately 25% of attention module parameters, which can be reallocated to other model components. This parameter efficiency is significant, as it contributes to improved training stability and control over outliers. The authors suggest that even if DIFF V2 matches the performance of baseline models without reducing loss, its benefits in training efficiency and stability make it a valuable alternative for future LLMs.

Questions about this article

No questions yet.