Quit Emailing Yourself

DFlash: Block Diffusion for Flash Speculative Decoding

6 min read | Saved February 14, 2026 | Copied!

dflash 🤖 speculative-decoding 🤖 diffusion 🤖 autoregressive 🤖 machine-learning 🤖

Do you care about this?

DFlash introduces a lightweight block diffusion model that enhances speculative decoding by enabling faster and more accurate parallel drafting. It combines the speed of diffusion models with the verification strength of autoregressive models, achieving significant performance improvements over existing methods like EAGLE-3. The approach demonstrates how to leverage the benefits of both model types without sacrificing quality.

If you do, here's more

DFlash introduces a lightweight block diffusion model aimed at improving speculative decoding in autoregressive large language models (LLMs). The method tackles the inefficiencies of traditional methods, like EAGLE-3, which rely on serial drafting and limit speedups. DFlash achieves up to lossless acceleration for models like Qwen3-8B, outperforming EAGLE-3. The key innovation lies in its use of diffusion for drafting, allowing for parallel token generation while still utilizing a robust autoregressive model for verification.

The design of DFlash leverages the hidden features of the target model to enhance the draft model's performance. By conditioning the draft model on these features, DFlash combines the speed of a lightweight diffusion model with the reasoning capability of a larger target model. This approach minimizes the memory requirements typically associated with diffusion methods, as previous models often required massive parameters that hindered practical application.

Benchmark results showcase significant speedups across various tasks. For instance, DFlash achieves an average speedup of 5.79x on math tasks compared to EAGLE-3's 2.19x. In coding benchmarks, DFlash shows an average speedup of 4.63x, highlighting its efficiency and effectiveness across different applications. The architecture relies on a 5-layer block diffusion model, balancing draft quality with processing speed. This strategy addresses the historical trade-off between speed and accuracy in LLMs, suggesting a more efficient path forward for speculative decoding.

Questions about this article

No questions yet.