A visual reasoning model based on the MoE architecture with 106B total parameters and 12B active. Achieves state-of-the-art performance among open-source VLMs of its scale across image, video, document understanding, and GUI tasks. Features a flexible thinking mode toggle for balancing speed and reasoning depth. Excels at webpage code generation from screenshots, object detection, document parsing, and long video analysis.
Try NowVisual reasoning with open-source VLM
Image and video understanding with thinking
Document analysis and GUI tasks
64,000 tokens
16,000 tokens
$0.60 per 1M tokens
$1.80 per 1M tokens
$0.11 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls