A compact multimodal model using mixture-of-experts architecture for text and image understanding. Designed for efficient multimodal experiences with vision support.
Try Now131,072 tokens
8,192 tokens
$0.08 per 1M tokens
$0.3 per 1M tokens
8 files
Yes
Poor tool calling capabilities and hallucinates web searches
$0 per month