A state-of-the-art 32B multimodal model excelling at a variety of critical benchmarks for language, text, and image capabilities. Serves 23 languages with full image understanding, allowing you to pass in images and text and get a single coherent response. Focused on state-of-the-art multilingual performance.
Try Now16,000 tokens
4,000 tokens
$0.50
$1.50
Vision research model; superseded by Command A Vision