Qwen3-Omni is a cutting-edge multilingual omni-modal foundation model capable of processing text, images, audio, and video, providing real-time streaming responses. It features significant architectural advancements for performance, supports 119 text languages, and offers various applications through detailed cookbooks, including speech recognition, audio captioning, and video analysis. The model is available for use via Hugging Face and ModelScope, with recommendations for optimal performance.
qwen3-omni ✓
multimodal ✓
+ multilingual
audio-captioning ✓
machine-learning ✓