InternVL3 78B

opengvlab

The InternVL3 series is an advanced multimodal large language model (MLLM). Compared to InternVL 2.5, InternVL3 demonstrates stronger multimodal perception and reasoning capabilities. In addition, InternVL3 is benchmarked against the Qwen2.5 Chat models, whose pre-trained base models serve as the initialization for its language component. Benefiting from Native Multimodal Pre-Training, the InternVL3 series surpasses the Qwen2.5 series in overall text performance.

Try Now

Capabilities

Image Input

Example Use Cases

Multimodal visual perception

Image understanding with reasoning

Vision-language analysis

Technical Specifications

Context Window

32,768 tokens

Max Output

32,768 tokens

Cache Miss Cost

$0.15 per 1M tokens

Non-Reasoning Cost

$0.60 per 1M tokens

Cache Read Cost

$0.075 per 1M tokens

Web Search Cost

$15 per 1K calls

Code Execution Cost

$0.19 per 1K calls

⚠️ Legacy

Made legacy on

Reason

Untested

Recommended Replacement

Qwen3.5 Plus