MiMo V2 Omni

Xiaomi

MiMo V2 Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability — visual grounding, multi-step planning, tool use, and code execution — making it well-suited for complex real-world tasks that span modalities. 256K context window.

Try Now

Capabilities

Thinking

Tool Use

Image Input

Example Use Cases

Multimodal tasks spanning image, video, and audio

Visual grounding and agentic planning

Omni-modal perception with tool use

Technical Specifications

Context Window

262,144 tokens

Max Output

65,536 tokens

Pricing

Token Costs (per 1M tokens)

Cache Miss Input

$0.40

Non-Reasoning Output

$2

Cache Read Input

$0.08

Tool Costs (per 1K calls)

Web Search

$15

Code Execution

$0.19

Legacy

Made legacy on

Reason

Untested

Recommended Replacement

Qwen3.5 Plus