A compact multimodal model using mixture-of-experts architecture for text and image understanding. Designed for efficient multimodal experiences with vision support.
Try NowEfficient multimodal tasks
Vision-enabled workflows
Compact llama 4
131,072 tokens
8,192 tokens
$0.08 per 1M tokens
$0.30 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls
Poor tool calling capabilities and hallucinates web searches