

Qwen3-Omni-30B
#37 in Multimodal Modelsqwen · v3 · omni 30b · seit 2025-09-22 · 2× · zuletzt 29. Juni 2026
Qwen3-Omni-30B is a natively end-to-end trained omni-modal language model from Alibaba's Qwen team, built on a Mixture-of-Experts architecture with 30 billion total parameters and 3 billion active parameters. It simultaneously processes text, image, audio, and video, and generates both text and real-time speech output. The model is released as open weights under the Apache 2.0 license and is also available via Alibaba's DashScope API. According to the official technical report, Qwen3-Omni achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks, outperforming closed-source systems such as Gemini-2.5-Pro on 22 benchmarks.
Features
| Context Window (Tokens) | 32,768 tokens native (Instruct variant); Thinking variant up to 65,536 tokens |
| Multimodal Inputs | Text, image, audio, video (input); text + natural language (output, real-time streaming) |
| On-Device vs. Cloud | Both: open-weight under Apache 2.0 (Hugging Face / ModelScope, self-hosting via vLLM or Transformers); cloud API via Alibaba DashScope |
| Price per Unit | $0.25 / 1M input tokens; $0.97 / 1M output tokens (Alibaba Cloud API, Instruct variant) |
| Video Analysis Capability | Supports video analysis (evaluated at fps=2); known weakness on long-video benchmarks due to limited context length and position extrapolation (noted as a future goal in the Technical Report) |
| Vision-Language Benchmark Score | MMStar: 68.5 (Instruct variant); on par with Qwen2.5-VL-72B; outperforms GPT-4o and Gemini-2.0-Flash on MMMU-Pro, MathVista, and MATH-Vision |