Spotlight is a 7-billion-parameter vision-language model derived from Qwen 2.5-VL and fine-tuned by Arcee AI for tight image-text grounding tasks. It offers a 32 k-token context window, enabling rich multimodal conversations that combine lengthy documents with one or more images. Training emphasized fast inference on consumer GPUs while retaining strong captioning, visual‐question-answering, and diagram-analysis accuracy. As a result, Spotlight slots neatly into agent workflows where screenshots, charts or UI mock-ups need to be interpreted on the fly. Early benchmarks show it matching or out-scoring larger VLMs such as LLaVA-1.6 13 B on popular VQA and POPE alignment tests.
Try NowLightweight image understanding
Screenshot or diagram interpretation
Visual question answering on a budget
131,072 tokens
65,537 tokens
$0.18 per 1M tokens
$0.18 per 1M tokens
$15 per 1K calls
$0.19 per 1K calls