A visual reasoning model based on the MoE architecture with 106B total parameters and 12B active. Achieves state-of-the-art performance among open-source VLMs of its scale across image, video, document understanding, and GUI tasks. Features a flexible thinking mode toggle for balancing speed and reasoning depth. Excels at webpage code generation from screenshots, object detection, document parsing, and long video analysis.
Try Now64,000 tokens
16,000 tokens
$0.60
$1.80
$0.11
$15
$0.19