A state-of-the-art 32B multimodal model excelling at a variety of critical benchmarks for language, text, and image capabilities. Serves 23 languages with full image understanding, allowing you to pass in images and text and get a single coherent response. Focused on state-of-the-art multilingual performance.
Try NowMultilingual image understanding
Cross-lingual visual question answering
Multimodal multilingual document analysis
16,000 tokens
4,000 tokens
$0.50 per 1M tokens
$1.50 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls