Quit Emailing Yourself

GitHub - D2I-ai/dasd-thinking

5 min read | Saved February 14, 2026 | Copied!

reasoning 🤖 distillation 🤖 models 🤖 benchmarks 🤖 data-efficiency 🤖

Do you care about this?

This article outlines Distribution-Aligned Sequence Distillation, a new pipeline for improving reasoning tasks like math and code generation using minimal training data. It introduces models such as DASD-4B-Thinking and DASD-30B-A3B-Thinking-Preview, which outperform larger models in various benchmarks. The methodology includes temperature-scheduled learning and mixed-policy distillation for better performance.

If you do, here's more

Distribution-Aligned Sequence Distillation (DASD) is an advanced pipeline that enhances reasoning tasks like mathematical problem-solving and code generation using minimal training data. It employs techniques such as temperature-scheduled learning, divergence-aware sampling, and mixed-policy distillation. These methods allow DASD to achieve top-tier performance, even outperforming larger models like the 32B-scale versions in benchmarks such as AIME24, AIME25, LiveCodeBench, and GPQA-Diamond.

Two key variants of DASD are highlighted. The DASD-4B-Thinking model is lightweight yet excels across various benchmarks, achieving notable results with multi-stage training. The DASD-30B-A3B-Thinking-Preview model uses a Mixture-of-Experts architecture, which allows for increased model capacity while maintaining efficiency through sparse expert routing. Despite being trained only on initial data, it shows impressive efficiency and quality trade-offs, particularly in AIME25 and LiveCodeBench benchmarks.

The article outlines the release of several resources, including the DASD-4B-Thinking model and related datasets, which are now available on platforms like Hugging Face and ModelScope. The methodology emphasizes a systematic approach to sequence-level distillation, focusing on optimizing reasoning diversity rather than just correctness. This is achieved through a coherent pipeline that combines different mechanisms to improve both learning stability and the fidelity of reasoning transfer, making it effective within a compact model. The technical details provided include code snippets for deploying the model, indicating its accessibility for developers looking to implement these advanced reasoning capabilities.

Questions about this article

No questions yet.