The "Thinking" edition of Qwen3-VL's second-largest MoE model offers fast response, enhanced multimodal understanding and reasoning, visual agent capabilities, and ultra-long context support (e.g., long videos and documents). It improves image/video comprehension, spatial perception, and object recognition to handle complex real-world tasks.
Try NowVisual reasoning on a budget
Multimodal agent task with thinking
Efficient image analysis with reasoning
131,072 tokens
32,768 tokens
$0.20 per 1M tokens
$2.40 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls