A multimodal model from the Llama 4 collection with MoE architecture for text and image tasks. Designed for multimodal experiences with vision capabilities.
Try NowMultimodal text and image tasks
Vision-enabled workflows
Llama 4 multimodal
131,072 tokens
8,192 tokens
$0.15 per 1M tokens
$0.60 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls
Poor tool calling capabilities and hallucinates web searches