A multimodal model from the Llama 4 collection with MoE architecture for text and image tasks. Designed for multimodal experiences with vision capabilities.
Try NowMultimodal text and image tasks
Vision-enabled workflows
Llama 4 multimodal
131,072 tokens
8,192 tokens
$0.15
$0.60
$15
$0.19
Poor tool calling capabilities and hallucinates web searches