Language

Synthszr Charts — die großen AI-Marken im Wettkampf ums Podium

Qwen3-Omni-30B

#37 in Multimodal Models

qwen · v3 · omni 30b · seit 2025-09-22 · 2× · zuletzt 29. Juni 2026

Momentum

Qwen3-Omni-30B is a natively end-to-end trained omni-modal language model from Alibaba's Qwen team, built on a Mixture-of-Experts architecture with 30 billion total parameters and 3 billion active parameters. It simultaneously processes text, image, audio, and video, and generates both text and real-time speech output. The model is released as open weights under the Apache 2.0 license and is also available via Alibaba's DashScope API. According to the official technical report, Qwen3-Omni achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks, outperforming closed-source systems such as Gemini-2.5-Pro on 22 benchmarks.

Momentum trend

04.04.03.07.

Features

Context Window (Tokens)	32,768 tokens native (Instruct variant); Thinking variant up to 65,536 tokens
Multimodal Inputs	Text, image, audio, video (input); text + natural language (output, real-time streaming)
On-Device vs. Cloud	Both: open-weight under Apache 2.0 (Hugging Face / ModelScope, self-hosting via vLLM or Transformers); cloud API via Alibaba DashScope
Price per Unit	$0.25 / 1M input tokens; $0.97 / 1M output tokens (Alibaba Cloud API, Instruct variant)
Video Analysis Capability	Supports video analysis (evaluated at fps=2); known weakness on long-video benchmarks due to limited context length and position extrapolation (noted as a future goal in the Technical Report)
Vision-Language Benchmark Score	MMStar: 68.5 (Instruct variant); on par with Qwen2.5-VL-72B; outperforms GPT-4o and Gemini-2.0-Flash on MMMU-Pro, MathVista, and MATH-Vision

Qwen3-Omni-30B

Features

Sources (2)

Subscribe free. Unsubscribe the second it sucks.