A compact multimodal model using mixture-of-experts architecture for text and image understanding. Designed for efficient multimodal experiences with vision support.
Try NowEfficient multimodal tasks
Vision-enabled workflows
Compact llama 4
131,072 tokens
8,192 tokens
$0.08
$0.30
$15
$0.19
Poor tool calling capabilities and hallucinates web searches