Quit Emailing Yourself

Qwen3-ASR Technical Report

2 min read | Saved February 14, 2026 | Copied!

speech-recognition 🤖 language-identification 🤖 forced-alignment 🤖 efficiency 🤖 open-source 🤖

Do you care about this?

This report presents the Qwen3-ASR family, featuring two advanced speech recognition models that support 52 languages. The 1.7B model offers top performance among open-source options, while the 0.6B model balances accuracy and efficiency, achieving rapid transcription and efficient forced alignment for text-speech pairs. Both models are released under the Apache 2.0 license for community use.

If you do, here's more

Qwen3-ASR is a new family of speech recognition models introduced in this technical report, which includes two main models: Qwen3-ASR-1.7B and Qwen3-ASR-0.6B. Both models support language identification and automatic speech recognition (ASR) for 52 languages and dialects. They rely on extensive speech training data and the advanced audio capabilities of their foundation model, Qwen3-Omni. The 1.7 billion parameter version stands out for achieving state-of-the-art performance among open-source ASR models and competes closely with leading proprietary APIs. Meanwhile, the 0.6 billion parameter model strikes a balance between accuracy and efficiency, achieving an impressive low average time to first token (TTFT) of 92 milliseconds and the ability to transcribe 2000 seconds of speech in just one second at a concurrency of 128.

Another noteworthy component is the Qwen3-ForcedAligner-0.6B, a non-autoregressive model designed for timestamp prediction. This model can align text and speech pairs across 11 languages, offering better accuracy and efficiency compared to the top three existing forced alignment models. The experiments conducted reveal significant performance improvements in real-world applications, highlighting the practical advantages of the Qwen3-ASR models over existing solutions.

To support further research and development in ASR and audio understanding, the authors have made these models available under the Apache 2.0 license. This open-access approach aims to foster collaboration and innovation within the community, allowing researchers and developers to build upon their findings and improve speech recognition technologies.

Questions about this article

No questions yet.