AI Models

Explore our comprehensive collection of AI models from leading providers. Find the perfect model for your needs.

aion-labs
Aion 1.0
Aion-1.0 is a multi-model system designed for high performance across various tasks, including reasoning and coding. It is built on DeepSeek-R1, augmented with additional models and techniques such as Tree of Thoughts (ToT) and Mixture of Experts (MoE). It is Aion Lab's most powerful reasoning model.
aion-labs
Aion 1.0 Mini
Aion-1.0-Mini 32B parameter model is a distilled version of the DeepSeek-R1 model, designed for strong performance in reasoning domains such as mathematics, coding, and logic. It is a modified variant of a FuseAI model that outperforms R1-Distill-Qwen-32B and R1-Distill-Llama-70B, with benchmark results independently replicated for verification.
aion-labs
Aion 2.0
Aion-2.0 is a variant of DeepSeek V3.2 optimized for immersive roleplaying and storytelling. It is particularly strong at introducing tension, crises, and conflict into stories, making narratives feel more engaging. It also handles mature and darker themes with more nuance and depth.
aion-labs
Aion RP 1.0 8B
Aion-RP-Llama-3.1-8B ranks the highest in the character evaluation portion of the RPBench-Auto benchmark, a roleplaying-specific variant of Arena-Hard-Auto, where LLMs evaluate each other’s responses. It is a fine-tuned base model rather than an instruct model, designed to produce more natural and varied writing.
cohere
Aya Expanse 32B
A highly performant 32B multilingual model designed to rival monolingual performance through innovations in instruction tuning with data arbitrage, preference training, and model merging. Serves 23 languages including Arabic, Chinese, Japanese, Korean, and major European languages. With 128K context window, it handles substantial multilingual workloads effectively.
cohere
Aya Expanse 8B
A compact 8B multilingual model designed to rival monolingual performance through innovations in instruction tuning with data arbitrage, preference training, and model merging. Serves 23 languages with fast response times and low latency. Ideal for high-throughput multilingual workloads where cost and speed matter.
cohere
Aya Vision 32B
A state-of-the-art 32B multimodal model excelling at a variety of critical benchmarks for language, text, and image capabilities. Serves 23 languages with full image understanding, allowing you to pass in images and text and get a single coherent response. Focused on state-of-the-art multilingual performance.
cohere
Aya Vision 8B
A compact 8B multimodal model excelling at a variety of critical benchmarks for language, text, and image capabilities. Focused on low latency and best-in-class performance with image understanding across multiple languages.
anthropic
Claude 1.0
Anthropic's first publicly available large language model. Claude 1.0 offered basic text generation and reasoning with a 9K context window. This model has been retired and is no longer available for use.
anthropic
Claude 1.1
An incremental improvement over Claude 1.0 with better instruction following and reduced harmful outputs. 9K context window. This model has been retired and is no longer available for use.
anthropic
Claude 1.2
A refined version of Claude 1 with improved helpfulness and safety characteristics. 9K context window. This model has been retired and is no longer available for use.
anthropic
Claude 1.3
The final Claude 1 release, introducing the breakthrough 100K token context window. Claude 1.3 offered significantly improved long-document understanding and reasoning over earlier versions. This model has been retired and is no longer available for use.
anthropic
Claude 2.0
Anthropic's second-generation large language model with 100K context window. Claude 2.0 offered improved reasoning, coding, and math capabilities over Claude 1. This model has been retired and is no longer available for use.
anthropic
Claude 2.1
Anthropic's improved Claude 2 with doubled 200K context window and reduced hallucination rates. Claude 2.1 introduced beta tool use capabilities. This model has been retired and is no longer available for use.
anthropic
Claude Haiku 3
The original fast and affordable Claude 3 model. Haiku 3 delivers rapid responses at rock-bottom pricing, processing 21K tokens per second for prompts under 32K tokens. With 200K context, vision capabilities, and tool use support, it handles basic tasks reliably. Limited to 4K output tokens. Superseded by Haiku 4.5 which offers dramatically better performance at a modest price increase.
anthropic
Claude Haiku 3.5
Anthropic's fast and cost-efficient model from October 2024. Haiku 3.5 delivered near-Claude-3-Opus-level performance at budget pricing with 200K context and 8K output tokens. Scored 88.1% on HumanEval and 40.6% on SWE-bench Verified. This model has been retired and is no longer available for use.
anthropic
Claude Haiku 4.5
Fast, efficient, and surprisingly capable? Haiku delivers near-flagship performance at budget-friendly pricing. This model excels at coding, agent workflows, and computer use tasks with Claude's characteristic helpfulness and safety. With 200K context and support for files and images, it's perfect for production deployments, sub-agent systems, and any scenario where you need reliable intelligence without the premium cost. Lightning-fast response times make it ideal for real-time applications.
anthropic
Claude Instant 1.0
Anthropic's first fast and affordable model. Claude Instant 1.0 offered quick response times for simple tasks with a 9K context window. This model has been retired and is no longer available for use.
anthropic
Claude Instant 1.1
An improved version of Claude Instant with expanded 100K context window. Claude Instant 1.1 offered faster response times for classification, summarization, and text generation tasks. This model has been retired and is no longer available for use.
anthropic
Claude Instant 1.2
Anthropic's final and best Instant model, designed for high-throughput tasks. Claude Instant 1.2 offered 100K context with quick response times for simple classification, summarization, and text generation tasks. This model has been retired and is no longer available for use.
anthropic
Claude Opus 3
The original Claude 3 flagship model, once the most intelligent Claude available. Opus 3 excelled at complex reasoning, nuanced analysis, and creative tasks. With 200K context and vision capabilities but limited to 4K output tokens. This model has been retired and is no longer available for use.
anthropic
Claude Opus 4
The first Claude 4 flagship model, delivering breakthrough coding and agentic performance with 72.5% on SWE-bench Verified. Opus 4 excels at autonomous research, multi-step reasoning, and complex problem-solving with 200K context and extended thinking capabilities. Superseded by Opus 4.1 and later generations.
anthropic
Claude Opus 4.1
The previous Opus generation delivering superior coding and agentic performance with 74.5% on SWE-bench Verified. Excellent for complex multi-step problems requiring rigor and precision.
anthropic
Claude Opus 4.5
The previous Opus generation for the hardest problems. Opus 4.5 excels at extremely complex reasoning, advanced coding challenges, sophisticated research, and intricate multi-step planning. With enhanced general intelligence and vision capabilities, this model tackles problems that push the boundaries of AI capability.
anthropic
Claude Opus 4.6
The world's best model for coding and professional work, built to power agents that take on whole categories of real-world work. Opus 4.6 excels across the entire SDLC, breaking through on hard problems, identifying complex bugs, and demonstrating deeper codebase understanding. It also delivers a step-change in knowledge work, with near-production-ready documents, presentations, and spreadsheets on the first pass. With 1M context window and 128K max output, this is the ultimate problem-solver.
anthropic
Claude Sonnet 3
The original Claude 3 balanced model offering a good combination of intelligence and speed. Sonnet 3 handled coding, analysis, and general tasks with 200K context and vision support. Limited to 4K output tokens. This model has been retired and is no longer available for use.
anthropic
Claude Sonnet 3.5
Claude's breakthrough mid-tier model from 2024, setting new standards for intelligence at its price point. The original June 2024 release and October 2024 update both delivered exceptional coding, analysis, and multimodal capabilities with 200K context and 8K output. The v2 update added computer use and PDF support. Both versions have been retired and are no longer available for use.
anthropic
Claude Sonnet 3.7
The first hybrid reasoning model from Anthropic, combining standard responses with extended thinking mode. Sonnet 3.7 introduced transparent step-by-step reasoning and excelled at coding and front-end development. With 200K context, 64K output, and thinking capabilities, it bridged the gap between Claude 3.5 and Claude 4. This model has been retired and is no longer available for use.
anthropic
Claude Sonnet 4
The first Claude 4 Sonnet model, offering a strong balance of intelligence and speed with extended thinking capabilities. Sonnet 4 handles complex coding, analysis, and reasoning tasks with 1M context window (beta) and 64K output. Superseded by Sonnet 4.5 which brings improved nuance and performance.
anthropic
Claude Sonnet 4.5
Our go-to model for sophisticated work that demands both intelligence and nuance. Sonnet 4.5 excels at complex coding projects, nuanced writing, detailed analysis, and thoughtful problem-solving. With thinking capabilities, multimodal support, and Claude's renowned ability to follow instructions precisely, this model strikes the perfect balance between capability and cost for professional-grade work. Ideal for when quality truly matters.
anthropic
Claude Sonnet 4.6
The latest Claude Sonnet model, building on 4.5's strengths with continued improvements. Sonnet 4.6 delivers excellent performance on complex coding projects, nuanced writing, and detailed analysis. Features thinking capabilities, multimodal support, and precise instruction following. A strong balance of capability and cost for professional-grade work.
alfredpros
CodeLLaMa 7B Instruct Solidity
A finetuned 7 billion parameters Code LLaMA - Instruct model to generate Solidity smart contract using 4-bit QLoRA finetuning provided by PEFT library.
arcee
Coder Large
Coder-Large is a 32 B-parameter offspring of Qwen 2.5-Instruct that has been further trained on permissively-licensed GitHub, CodeSearchNet and synthetic bug-fix corpora. It supports a 32k context window, enabling multi-file refactoring or long diff review in a single call, and understands 30-plus programming languages with special attention to TypeScript, Go and Terraform. Internal benchmarks show 5-8 pt gains over CodeLlama-34 B-Python on HumanEval and competitive BugFix scores thanks to a reinforcement pass that rewards compilable output. The model emits structured explanations alongside code blocks by default, making it suitable for educational tooling as well as production copilot scenarios. Cost-wise, Together AI prices it well below proprietary incumbents, so teams can scale interactive coding without runaway spend.
mistral
Codestral
Our cutting-edge language model for coding released August 2025.
deepcogito
Cogito v2.1 671B
Cogito v2.1 671B MoE represents one of the strongest open models globally, matching performance of frontier closed and open models. This model is trained using self play with reinforcement learning to reach state-of-the-art performance on multiple categories (instruction following, coding, longer queries and creative writing). This advanced system demonstrates significant progress toward scalable superintelligence through policy improvement.
cohere
Command
An instruction-following conversational model that performs language tasks with high quality and a 4K context window. This model has been deprecated by Cohere as of September 2025.
cohere
Command A
Cohere's most performant model, excelling at tool use, agents, retrieval augmented generation (RAG), and multilingual use cases. Command A has a 256K context window, only requires two GPUs to run, and has 150% higher throughput compared to Command R+ 08-2024. Ideal for enterprise deployments requiring strong tool integration and multilingual support.
cohere
Command A Reasoning
Cohere's first reasoning model, able to think before generating an output in a way that allows it to perform well in certain kinds of nuanced problem-solving and agent-based tasks in 23 languages. With a 256K context window and 32K max output, it excels at complex analytical work.
cohere
Command A Translate
Cohere's state-of-the-art machine translation model, excelling at a variety of translation tasks on 23 languages including English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic, Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.
cohere
Command A Vision
Cohere's first model capable of processing images, excelling in enterprise use cases such as analyzing charts, graphs, and diagrams, table understanding, OCR, document Q&A, and object detection. Officially supports English, Portuguese, Italian, French, German, and Spanish with a 128K context window.
cohere
Command Light
A smaller, faster version of Command. Almost as capable but with much lower latency and cost. 4K context window. This model has been deprecated by Cohere as of September 2025.
cohere
Command R
An instruction-following conversational model that performs language tasks at a higher quality, more reliably, and with a longer context than previous models. Best suited for complex workflows like code generation, retrieval augmented generation (RAG), tool use, and agents. 128K context window with 4K max output.
cohere
Command R+
An instruction-following conversational model that performs language tasks at a higher quality, more reliably, and with a longer context than previous models. Best suited for complex RAG workflows and multi-step tool use. 128K context window with 4K max output.
cohere
Command R7B
A small, fast model that excels at RAG, tool use, agents, and similar tasks requiring complex reasoning and multiple steps. With 128K context window and extremely low pricing, it is ideal for high-throughput production workloads where cost and speed are paramount.
thedrummer
Cydonia 24B V4.1
Uncensored and creative writing model based on Mistral Small 3.2 24B with good recall, prompt adherence, and intelligence.
nousresearch
DeepHermes 3 Mistral 24B
DeepHermes 3 (Mistral 24B Preview) is an instruction-tuned language model by Nous Research based on Mistral-Small-24B, designed for chat, function calling, and advanced multi-turn reasoning. It introduces a dual-mode system that toggles between intuitive chat responses and structured “deep reasoning” mode using special system prompts. Fine-tuned via distillation from R1, it supports structured output (JSON mode) and function call syntax for agent-based applications. DeepHermes 3 supports a reasoning toggle via system prompt, allowing users to switch between fast, intuitive responses and deliberate, multi-step reasoning. When activated with a specific system instruction, the model enters a deep thinking mode, generating extended chains of thought wrapped in `<think></think>` tags before delivering a final answer.
deepseek
DeepSeek R1
DeepSeek's advanced reasoning model from the R1 series, released May 2025. Built for deep chain-of-thought reasoning, it excels at complex mathematical, logical, and analytical problems. With 131K context and 32K output window, it handles substantial reasoning tasks. Superseded by the V3.2 Speciale model which achieves stronger reasoning at lower cost.
tngtech
DeepSeek R1T Chimera
DeepSeek-R1T-Chimera is created by merging DeepSeek-R1 and DeepSeek-V3 (0324), combining the reasoning capabilities of R1 with the token efficiency improvements of V3. It is based on a DeepSeek-MoE Transformer architecture and is optimized for general text generation tasks. The model merges pretrained weights from both source models to balance performance across reasoning, efficiency, and instruction-following tasks. It is released under the MIT license and intended for research and commercial use.
tngtech
DeepSeek R1T2 Chimera
DeepSeek-TNG-R1T2-Chimera is the second-generation Chimera model from TNG Tech. It is a 671 B-parameter mixture-of-experts text-generation model assembled from DeepSeek-AI’s R1-0528, R1, and V3-0324 checkpoints with an Assembly-of-Experts merge. The tri-parent design yields strong reasoning performance while running roughly 20 % faster than the original R1 and more than 2× faster than R1-0528 under vLLM, giving a favorable cost-to-intelligence trade-off. The checkpoint supports contexts up to 60 k tokens in standard use (tested to ~130 k) and maintains consistent <think> token behaviour, making it suitable for long-context analysis, dialogue and other open-ended generation tasks.
deepseek
DeepSeek V3
DeepSeek's updated V3 model released on March 24, 2025. A reliable general-purpose model with 131K context window and tool support. Offers excellent cost-effectiveness for standard chat and completion tasks. Superseded by V3.1 and V3.2 series with improved capabilities.
deepseek
DeepSeek V3.1
DeepSeek's powerful open-source model with thinking capabilities and tool support. With 131K context and cache-based pricing, it offers strong performance for general reasoning, coding, and analysis tasks. Superseded by the V3.2 series with expanded context window and improved efficiency.
nex-agi
DeepSeek V3.1 Nex N1
DeepSeek V3.1 Nex-N1 is the flagship release of the Nex-N1 series — a post-trained model designed to highlight agent autonomy, tool use, and real-world productivity. Nex-N1 demonstrates competitive performance across all evaluation scenarios, showing particularly strong results in practical coding and HTML generation tasks.
deepseek
DeepSeek V3.1 Terminus
The production-optimized variant of DeepSeek V3.1, tuned for stable and consistent output quality. Shares the same pricing and context capabilities as V3.1 with improved deployment characteristics. Superseded by the V3.2 series.
deepseek
DeepSeek V3.2
An innovative model pioneering breakthrough efficiency in long-context processing. Using revolutionary Sparse Attention technology, this model handles massive contexts (128K tokens) with exceptional speed and minimal resource use while maintaining quality. With thinking capabilities and an impressive 65K output window, it excels at tasks requiring extensive context understanding. Perfect for processing large documents, codebases, or datasets where traditional models slow down. Exceptional value for long-context work.
deepseek
DeepSeek V3.2 Experimental
An innovative experimental model pioneering breakthrough efficiency in long-context processing. Using revolutionary Sparse Attention technology, this model handles massive contexts (164K tokens) with exceptional speed and minimal resource use while maintaining quality. With thinking capabilities and an impressive 65K output window, it excels at tasks requiring extensive context understanding. Perfect for processing large documents, codebases, or datasets where traditional models slow down. Exceptional value for long-context work.
deepseek
DeepSeek V3.2 Experimental Thinking
The thinking-optimized variant of our experimental long-context model, designed for deep reasoning over extensive information. This model allocates more computational resources to explicit reasoning while processing massive contexts efficiently. With 164K context support, it excels at analytical tasks requiring both breadth of information and depth of thought. Choose this when you need thorough, reasoned analysis of large amounts of information. Extremely cost-effective for research and analysis.
deepseek
DeepSeek V3.2 Speciale
The most powerful reasoning model in the DeepSeek lineup, pushing the absolute boundaries of AI reasoning capabilities. V3.2-Speciale achieves gold-medal performance in IMO, CMO, ICPC World Finals, and IOI 2025, rivaling Gemini-3.0-Pro on complex tasks. This model excels at the hardest mathematical, logical, and competitive programming challenges where no other model can compete. Note: Optimized purely for reasoning—does not support tool calls and consumes more tokens than standard models. Choose this when you need world-class reasoning on the most demanding problems.
deepseek
DeepSeek V3.2 Thinking
The thinking-optimized variant of our long-context model, designed for deep reasoning over extensive information. This model allocates more computational resources to explicit reasoning while processing massive contexts efficiently. With 128K context support, it excels at analytical tasks requiring both breadth of information and depth of thought. Choose this when you need thorough, reasoned analysis of large amounts of information. Extremely cost-effective for research and analysis.
mistral
Devstral
Official devstral-2512 Mistral AI model
mistral
Devstral Medium
Our medium code-agentic model.
mistral
Devstral Small
Our small open-source code-agentic model.
baidu
ERNIE 4.5 21B A3B
A sophisticated text-based Mixture-of-Experts (MoE) model featuring 21B total parameters with 3B activated per token, delivering exceptional multimodal understanding and generation through heterogeneous MoE structures and modality-isolated routing. Supporting an extensive 131K token context length, the model achieves efficient inference via multi-expert parallel collaboration and quantization, while advanced post-training techniques including SFT, DPO, and UPO ensure optimized performance across diverse applications with specialized routing and balancing losses for superior task handling.
baidu
ERNIE 4.5 21B A3B Thinking
ERNIE-4.5-21B-A3B-Thinking is Baidu's upgraded lightweight MoE model, refined to boost reasoning depth and quality for top-tier performance in logical puzzles, math, science, coding, text generation, and expert-level academic benchmarks.
baidu
ERNIE 4.5 300B A47B
ERNIE-4.5-300B-A47B is a 300B parameter Mixture-of-Experts (MoE) language model developed by Baidu as part of the ERNIE 4.5 series. It activates 47B parameters per token and supports text generation in both English and Chinese. Optimized for high-throughput inference and efficient scaling, it uses a heterogeneous MoE structure with advanced routing and quantization strategies, including FP8 and 2-bit formats. This version is fine-tuned for language-only tasks and supports reasoning, tool parameters, and extended context lengths up to 131k tokens. Suitable for general-purpose LLM applications with high reasoning and throughput demands.
baidu
ERNIE 4.5 VL 28B A3B
A powerful multimodal Mixture-of-Experts chat model featuring 28B total parameters with 3B activated per token, delivering exceptional text and vision understanding through its innovative heterogeneous MoE structure with modality-isolated routing. Built with scaling-efficient infrastructure for high-throughput training and inference, the model leverages advanced post-training techniques including SFT, DPO, and UPO for optimized performance, while supporting an impressive 131K context length and RLVR alignment for superior cross-modal reasoning and generation capabilities.
baidu
ERNIE 4.5 VL 424B A47B
ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.
google
Gemini 2.0 Flash
A fast and cost-effective model with native tool use, code execution, and web search grounding. Supports a 1M token context window with multimodal inputs including text, images, audio, and video. Experimental thinking support enables configurable reasoning. Ideal for high-volume tasks that need broad capability at minimal cost.
google
Gemini 2.0 Flash Lite
The most affordable Gemini model, optimized for cost efficiency and low latency. Supports a 1M token context window with multimodal inputs and function calling. No native code execution, web search, or thinking support. Best suited for high-volume simple tasks where cost is the primary concern.
google
Gemini 2.5 Flash
The perfect all-rounder combining intelligence, speed, and value. This thinking model delivers excellent performance across diverse tasks with a massive 1M context window and full multimodal support. Whether you need code generation, document analysis, visual understanding, or complex reasoning, Flash handles it with grace. The sweet spot between capability and cost makes it our recommended choice for most professional work. With strong thinking capabilities, it provides both quality and transparency.
google
Gemini 2.5 Flash Image (Nano Banana)
The first hybrid reasoning image generator combining speed, intelligence, and creative control. Nano Banana creates images from text, edits them conversationally across multiple turns, and generates interleaved text-and-image responses. With configurable thinking budgets, you control the balance between quality, cost, and speed. Locale-aware generation ensures culturally appropriate visuals for global audiences. Perfect for rapid creative iteration, conversational image editing, and projects requiring both visual and textual content together. Fast, flexible, and surprisingly capable.
google
Gemini 2.5 Flash Lite
Blazingly fast and incredibly affordable, without sacrificing capability. This lightweight model offers an extraordinary 1M token context window with multimodal support at breakthrough pricing. With configurable thinking and tool connectivity, it handles diverse tasks from quick queries to complex document analysis. The massive context window means you can process entire books, codebases, or datasets in a single request. Perfect for high-volume applications and cost-conscious projects that still need quality results.
google
Gemini 2.5 Pro
The previous-generation advanced Gemini reasoning model capable of solving complex problems with 1M context and comprehensive multimodal support including audio and video.
google
Gemini 3 Flash
Google's most intelligent model balanced for speed and cost, combining frontier intelligence with superior search and grounding. Gemini 3 Flash delivers exceptional reasoning capabilities across a massive 1M context window while maintaining fast response times. With full multimodal support including vision and tool use, it excels at complex analytical tasks, research, and code generation. The perfect choice when you need top-tier intelligence and speed or cost is a consideration.
google
Gemini 3 Pro
The most advanced Gemini model, pushing the boundaries of multimodal reasoning and complex problem-solving. This preview model excels at sophisticated analytical tasks with support for text, images, audio, video, and documents. With a 1M context window and enhanced reasoning capabilities, it tackles problems that require deep understanding across multiple modalities. Choose this for cutting-edge multimodal work, advanced research, or when you need the absolute best in visual and analytical reasoning. The future of multimodal AI.
google
Gemini 3 Pro Image (Nano Banana Pro)
Professional-grade image generation delivering studio-quality, production-ready visuals with unparalleled precision and control. Building on Nano Banana's foundation, the Pro version adds enhanced reasoning, deep world knowledge, sophisticated text rendering and translation within images, and studio-level fine controls. Create high-fidelity visuals with accurate text, cultural nuance, and functional design precision. Perfect for professional projects, marketing materials, product designs, and any work requiring publication-ready quality. The ultimate image generation model for serious creative work.
google
Gemini 3.1 Flash Image (Nano Banana 2)
Gemini 3.1 Flash Image, a.k.a. "Nano Banana 2," is Google's latest state-of-the-art image generation and editing model, delivering Pro-level visual quality at Flash speed. It combines advanced contextual understanding with fast, cost-efficient inference, making complex image generation and iterative edits significantly more accessible.
google
Gemini 3.1 Flash Lite
Gemini 3.1 Flash Lite is Google's high-efficiency model optimized for high-volume use cases. It outperforms Gemini 2.5 Flash Lite on overall quality and approaches Gemini 2.5 Flash performance across key capabilities. Improvements span audio input/ASR, RAG snippet ranking, translation, data extraction, and code completion. Supports full thinking levels (minimal, low, medium, high) for fine-grained cost/performance trade-offs. Priced at half the cost of Gemini 3 Flash.
google
Gemini 3.1 Pro
The most advanced Gemini model with significantly improved reasoning, SWE and agentic capabilities. Building on Gemini 3 Pro, this model delivers better token efficiency, expanded thinking levels, and stronger performance on complex problem-solving benchmarks. With a 1M context window and full multimodal support, it excels at ambitious agentic workflows, coding, multi-step function calling, planning, and deep knowledge tasks. Choose this for the most demanding analytical, research, and engineering challenges.
google
Gemma 2 27B
Gemma 2 27B by Google is an open model built from the same research and technology used to create the Gemini models. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Usage of Gemma is subject to Google's Gemma Terms of Use.
google
Gemma 2 9B
Gemma 2 9B by Google is an advanced, open-source language model that sets a new standard for efficiency and performance in its size class. Designed for a wide variety of tasks, it empowers developers and researchers to build innovative applications, while maintaining accessibility, safety, and cost-effectiveness. Usage of Gemma is subject to Google's Gemma Terms of Use.
google
Gemma 3 12B
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 12B is the second largest in the family of Gemma 3 models after Gemma 3 27B.
google
Gemma 3 27B
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2.
google
Gemma 3 4B
Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.
google
Gemma 3n 2B
Gemma 3n E2B IT is a multimodal, instruction-tuned model developed by Google DeepMind, designed to operate efficiently at an effective parameter size of 2B while leveraging a 6B architecture. Based on the MatFormer architecture, it supports nested submodels and modular composition via the Mix-and-Match framework. Gemma 3n models are optimized for low-resource deployment, offering 32K context length and strong multilingual and reasoning performance across common benchmarks. This variant is trained on a diverse corpus including code, math, web, and multimodal data.
google
Gemma 3n 4B
Gemma 3n E4B-it is optimized for efficient execution on mobile and low-resource devices, such as phones, laptops, and tablets. It supports multimodal inputs—including text, visual data, and audio—enabling diverse tasks such as text generation, speech recognition, translation, and image analysis. Leveraging innovations like Per-Layer Embedding (PLE) caching and the MatFormer architecture, Gemma 3n dynamically manages memory usage and computational load by selectively activating model parameters, significantly reducing runtime resource requirements. This model supports a wide linguistic range (trained in over 140 languages) and features a flexible 32K token context window. Gemma 3n can selectively load parameters, optimizing memory and computational efficiency based on the task or device capabilities, making it well-suited for privacy-focused, offline-capable applications and on-device AI solutions.
zai
GLM 4 32B
A highly cost-effective 32B foundation model with enhanced capabilities in tool use, online search, and code-related intelligent tasks. Pre-trained on 15T of high-quality data including abundant synthetic reasoning data, it performs comparably to much larger models on many benchmarks. At just $0.1 per million tokens for both input and output, it delivers exceptional value for production workloads requiring tool invocation, information extraction, and code generation.
zai
GLM 4.5
The most powerful GLM reasoning model with 355B total parameters and 32B active per forward pass using Mixture-of-Experts architecture. GLM-4.5 ranks second globally among all models on aggregated benchmarks, first among domestic and open-source models. Purpose-built for agent-oriented applications, it excels at tool invocation, web browsing, software engineering, and front-end development. Supports hybrid reasoning modes for both complex thinking and instant responses.
zai
GLM 4.5 Air
A streamlined, efficient agent-focused model using Mixture-of-Experts architecture. With 106B total parameters but only 12B active per task, this model delivers impressive intelligence while remaining fast and cost-effective. Purpose-built for agentic applications, it excels at tool use and autonomous workflows. The thinking capabilities provide transparency in decision-making. With 128K context and 96K output, it handles substantial tasks comfortably. Perfect for production agent systems where you need reliability and efficiency without breaking the budget.
zai
GLM 4.5 AirX
The high-speed variant of GLM-4.5-Air, delivering ultra-fast response times while maintaining strong performance. With 106B total parameters and 12B active per forward pass, it combines the efficiency of the Air architecture with optimized inference speed exceeding 100 tokens per second. Ideal for low-latency production deployments where speed matters alongside intelligent agent capabilities.
zai
GLM 4.5 Flash
A completely free GLM model with strong reasoning, coding, and agent capabilities. Despite being free, it delivers impressive performance suitable for a wide range of tasks including development workflows, agent applications, and general reasoning. With 200K context and thinking support, it provides substantial capability at zero cost — perfect for experimentation, prototyping, and budget-sensitive production use.
zai
GLM 4.5 X
The premium high-speed variant of GLM-4.5, delivering the full reasoning power of the flagship 355B MoE model with ultra-fast inference. Optimized for scenarios requiring both strong reasoning capabilities and rapid response times, it provides the best of both worlds for demanding production workloads. Ideal for interactive agent applications and real-time coding assistance where latency is critical.
zai
GLM 4.5V
A visual reasoning model based on the MoE architecture with 106B total parameters and 12B active. Achieves state-of-the-art performance among open-source VLMs of its scale across image, video, document understanding, and GUI tasks. Features a flexible thinking mode toggle for balancing speed and reasoning depth. Excels at webpage code generation from screenshots, object detection, document parsing, and long video analysis.
zai
GLM 4.6
The latest and most capable GLM model with comprehensive improvements across all domains. This versatile model excels at real-world coding, handles long contexts up to 200K tokens, and delivers strong performance in reasoning, research, writing, and agentic workflows. With thinking capabilities and an impressive 96K output window, it tackles diverse professional tasks with confidence. The well-rounded upgrade brings enhanced capabilities across the board while maintaining excellent value. Choose this for sophisticated work requiring versatility and depth.
zai
GLM 4.6V
A capable multimodal model achieving state-of-the-art visual understanding among models of similar scale. GLM 4.6V combines strong image analysis with the reasoning and tool use capabilities of the GLM family. With 128K context support and vision capabilities, it handles image understanding, document analysis, and visual reasoning tasks effectively. An excellent choice for multimodal workflows where you need reliable visual comprehension without premium pricing.
zai
GLM 4.6V Flash
A completely free multimodal model with native function calling support from the GLM-4.6V series. Handles image, video, and document understanding at zero cost while supporting tool invocation for building multimodal agents. With 128K context, it provides substantial capability for visual understanding workflows without any API costs.
zai
GLM 4.6V FlashX
A lightweight, high-speed multimodal model from the GLM-4.6V series with native function calling and thinking mode support. Delivers fast visual understanding at a fraction of the cost of the flagship GLM-4.6V while maintaining strong capabilities across image, video, and document tasks. Ideal for production multimodal agents requiring low latency and affordable pricing.
zai
GLM 4.7
The latest and most capable GLM model with comprehensive improvements across all domains. This versatile model excels at real-world coding, handles long contexts up to 205K tokens, and delivers strong performance in reasoning, research, writing, and agentic workflows. With thinking capabilities and an impressive 131K output window, it tackles diverse professional tasks with confidence. The well-rounded upgrade brings enhanced capabilities across the board while maintaining excellent value. Choose this for sophisticated work requiring versatility and depth.
zai
GLM 4.7 Flash
A completely free model from the GLM-4.7 series that achieves open-source SOTA scores among comparable-sized models on SWE-bench Verified and agent benchmarks. Excels at both frontend and backend development, plus general tasks like writing, translation, and role-playing. With 200K context, thinking support, and zero cost, it provides exceptional value for development workflows and agent applications.
zai
GLM 4.7 FlashX
A lightweight, high-speed variant of GLM-4.7 delivering enhanced general capabilities and optimized agentic coding at a fraction of the cost. With 200K context, thinking support, and rapid inference, it balances strong programming ability with affordability. Ideal for high-throughput development workflows and agent systems where speed and cost-efficiency matter.
zai
GLM 5
Zai's new-generation flagship foundation model designed for Agentic Engineering. GLM-5 delivers state-of-the-art open-source performance in coding and agent capabilities, with usability in real programming scenarios approaching Claude Opus 4.5. Built for complex system engineering and long-range agent tasks, it provides reliable productivity across demanding workflows. With 203K context, 131K output, thinking capabilities, and implicit caching, it excels at sophisticated agentic applications requiring depth and persistence.
alpindale
Goliath 120B
A large LLM created by combining two fine-tuned Llama 70B models into one 120B model. Combines Xwin and Euryale.
openai
GPT Audio
The GPT Audio model is OpenAI's first generally available audio model. It features an upgraded decoder for more natural sounding voices and maintains better voice consistency.
openai
GPT Audio Mini
A cost-efficient version of GPT Audio. It features an upgraded decoder for more natural sounding voices and maintains better voice consistency.
openai
GPT OSS 120B
GPT OSS 120B is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. It activates 5.1B parameters per forward pass and is optimized to run on a single H100 GPU with native MXFP4 quantization. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.
openai
GPT OSS 20B
GPT OSS 20B is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference and deployability on consumer or single-GPU hardware.
openai
GPT OSS Safeguard 20B
GPT OSS Safeguard 20B is a safety reasoning model from OpenAI built upon GPT OSS 20B. This open-weight, 21B-parameter Mixture-of-Experts (MoE) model offers lower latency for safety tasks like content classification, LLM filtering, and trust and safety labeling.
openai
GPT-3.5 Turbo
GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks.
openai
GPT-3.5 Turbo 16K
This model offers four times the context length of GPT-3.5 Turbo, allowing it to support approximately 20 pages of text in a single request at a higher cost.
openai
GPT-3.5 Turbo Instruct
This model is a variant of GPT-3.5 Turbo tuned for instructional prompts and omitting chat-related optimizations.
openai
GPT-4
OpenAI's GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning capabilities.
openai
GPT-4 Turbo
The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Training data: up to December 2023.
openai
GPT-4.1
GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding, instruction compliance, and multimodal understanding benchmarks.
openai
GPT-4.1 Mini
GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and shows strong coding ability and vision understanding.
openai
GPT-4.1 Nano
For tasks that demand low latency, GPT-4.1 Nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window. It's ideal for tasks like classification or autocompletion.
openai
GPT-4o
GPT-4o is OpenAI's multimodal AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twice as fast and 50% more cost-effective.
openai
GPT-4o Audio
The GPT-4o Audio model adds support for audio inputs as prompts. This enhancement allows the model to detect nuances within audio recordings and add depth to generated user experiences.
openai
GPT-4o Mini
GPT-4o Mini is OpenAI's advanced small model, many multiples more affordable than other recent frontier models. It maintains SOTA intelligence while being significantly more cost-effective.
openai
GPT-4o Mini Search
GPT-4o Mini Search is a specialized model for web search in Chat Completions. It is trained to understand and execute web search queries.
openai
GPT-4o Search
GPT-4o Search is a specialized model for web search in Chat Completions. It is trained to understand and execute web search queries.
openai
GPT-5
OpenAI's original GPT-5 flagship excelling at complex reasoning, broad knowledge, advanced coding, and multi-step agentic tasks.
openai
GPT-5 Chat
GPT-5 Chat is designed for advanced, natural, multimodal, and context-aware conversations for enterprise applications.
openai
GPT-5 Codex
GPT-5 Codex is a specialized version of GPT-5 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review.
openai
GPT-5 Mini
The sweet spot for everyday AI work? intelligent, fast, and affordable. This model excels at reasoning, conversation, and general tasks with an optimal balance of capability and cost. With 400K context, multimodal support, and thinking capabilities, it handles most professional work confidently. The GPT-5 architecture delivers reliable quality across coding, writing, analysis, and problem-solving. Our most popular choice for teams who need consistent, high-quality performance without premium pricing. An excellent general-purpose workhorse.
openai
GPT-5 Nano
Lightning-fast and incredibly cost-effective for high-throughput workloads. This model specializes in straightforward instructions and classification tasks where speed is essential. With a massive 400K context window and multimodal support, it processes large volumes of simple tasks efficiently. The thinking capability is tuned for quick, decisive responses rather than deep contemplation. Perfect for production systems handling thousands of simple requests, real-time classification, or any scenario requiring fast, economical processing with GPT-5 architecture.
openai
GPT-5 Pro
The ultimate thinking machine for problems that demand maximum intelligence and computational effort. GPT-5 Pro allocates massive compute resources to think deeply and thoroughly about the hardest challenges. With an extraordinary 272K output window and extensive thinking capabilities, this model tackles problems other AIs simply cannot solve. Requests may take minutes to complete as it works through complex reasoning chains. For cutting-edge research, groundbreaking problem-solving, and situations where correctness is paramount and time is secondary.
openai
GPT-5.1
OpenAI's flagship model with adaptive thinking that allocates computational effort based on question complexity. This model excels at sophisticated reasoning, deep real-world knowledge, advanced coding challenges, and complex multi-step workflows. It intelligently spends more time on hard problems while responding quickly to simpler ones. Perfect for professional work requiring OpenAI's best capabilities across reasoning, knowledge, and technical execution.
openai
GPT-5.1 Chat
GPT-5.1 Chat is the fast, lightweight member of the 5.1 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively think on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning.
openai
GPT-5.1 Codex
GPT-5.1 Codex is a specialized version of GPT-5.1 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5.1, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs.
openai
GPT-5.1 Codex Max
GPT-5.1 Codex Max is OpenAI's agentic coding model, designed for long-running, high-context software development tasks. It is based on an updated version of the 5.1 reasoning stack and trained on agentic workflows spanning software engineering, mathematics, and research. GPT-5.1 Codex Max delivers faster performance, improved reasoning, and higher token efficiency across the development lifecycle.
openai
GPT-5.1 Codex Mini
GPT-5.1 Codex Mini is a smaller and faster version of GPT-5.1 Codex.
openai
GPT-5.1 Instant
A conversational variant of GPT-5 with warmer tone, improved instruction following, and adaptive reasoning. Designed for purely conversational applications rather than research.
openai
GPT-5.1 Thinking
OpenAI's premier thinking model with precisely tuned adaptive reasoning. This upgraded version excels at complex analytical tasks, sophisticated coding, and multi-step problem-solving with transparent thought processes.
openai
GPT-5.2
OpenAI's best general-purpose model, part of the GPT-5 flagship model family. GPT-5.2 is their most intelligent model yet for both general and agentic tasks. With a 400K context window, multimodal capabilities including image generation, and advanced reasoning, this model excels at sophisticated coding, complex analysis, and multi-step workflows. The ideal choice for professional work requiring OpenAI's cutting-edge capabilities.
openai
GPT-5.2 Chat
GPT-5.2 Chat is the fast, lightweight member of the 5.2 family, optimized for low-latency chat while retaining strong general intelligence. It uses adaptive reasoning to selectively think on harder queries, improving accuracy on math, coding, and multi-step tasks without slowing down typical conversations. The model is warmer and more conversational by default, with better instruction following and more stable short-form reasoning.
openai
GPT-5.2 Codex
GPT-5.2 Codex is an upgraded version of GPT-5.1 Codex optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5.1 Codex, 5.2 Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs. Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically, providing fast responses for small tasks while sustaining extended multi-hour runs for large projects.
openai
GPT-5.2 Pro
The ultimate version of GPT-5.2 that produces smarter and more precise responses. This model allocates massive compute resources to think deeply and thoroughly about the hardest challenges. With a 400K context window, multimodal capabilities including image generation, and maximum reasoning power, it tackles problems that require the highest quality thinking available. For cutting-edge research, groundbreaking problem-solving, and situations where precision and correctness are paramount.
openai
GPT-5.3 Chat
GPT-5.3 Chat is an update to ChatGPT's most-used model that makes everyday conversations smoother, more useful, and more directly helpful. It delivers more accurate answers with better contextualization and significantly reduces unnecessary refusals, caveats, and overly cautious phrasing that can interrupt conversational flow.
openai
GPT-5.3 Codex
GPT-5.3 Codex is OpenAI's most advanced agentic coding model, combining the frontier software engineering performance of GPT-5.2 Codex with the broader reasoning and professional knowledge capabilities of GPT-5.2. It achieves state-of-the-art results on SWE-Bench Pro and strong performance on Terminal-Bench 2.0 and OSWorld-Verified, reflecting improved multi-language coding, terminal proficiency, and real-world computer-use skills. The model is optimized for long-running, tool-using workflows and supports interactive steering during execution, making it suitable for complex development tasks, debugging, deployment, and iterative product work. Beyond coding, GPT-5.3 Codex performs strongly on structured knowledge-work benchmarks such as GDPval, supporting tasks like document drafting, spreadsheet analysis, slide creation, and operational research across domains.
openai
GPT-5.4
GPT-5.4 is OpenAI's latest frontier model, unifying the Codex and GPT lines into a single system. It features a 1M+ token context window (922K input, 128K output) with support for text and image inputs, enabling high-context reasoning, coding, and multimodal analysis within the same workflow. The model delivers improved performance in coding, document understanding, tool use, and instruction following. It is designed as a strong default for both general-purpose tasks and software engineering, capable of generating production-quality code, synthesizing information across multiple sources, and executing complex multi-step workflows with fewer iterations and greater token efficiency.
openai
GPT-5.4 Pro
GPT-5.4 Pro is OpenAI's most advanced model, building on GPT-5.4's unified architecture with enhanced reasoning capabilities for complex, high-stakes tasks. It features a 1M+ token context window (922K input, 128K output) with support for text and image inputs. Optimized for step-by-step reasoning, instruction following, and accuracy, GPT-5.4 Pro excels at agentic coding, long-context workflows, and multi-step problem solving.
ibm
Granite 4.0 Micro
Granite-4.0-H-Micro is a 3B parameter from the Granite 4 family of models. These models are the latest in a series of models released by IBM. They are fine-tuned for long context tool calling.
xai
Grok 2 Vision
xAI's legacy vision model with 32K context supporting text and image inputs with function calling and structured outputs. Superseded by Grok 4.
xai
Grok 3
xAI's previous-generation flagship text model with 131K context, function calling, and structured output support. Superseded by the Grok 4 series.
xai
Grok 3 Mini
xAI's compact thinking model with 131K context and reasoning capabilities at an affordable price point. Supports function calling and structured outputs. Superseded by Grok 4.1 Fast Reasoning.
xai
Grok 4
xAI's premier flagship model combining exceptional natural language understanding, mathematical prowess, and sophisticated reasoning. This well-rounded model excels across diverse domains?from creative writing to complex calculations to logical problem-solving. With 256K context for both input and output, thinking capabilities, and vision support, it handles virtually any task with intelligence and nuance. The true jack-of-all-trades that masters most of them. Perfect when you need a single model that performs excellently across the board.
xai
Grok 4 Fast
xAI's previous-generation fast multimodal model with 2M context and cost-efficient performance. Designed for rapid agentic workflows without extended reasoning.
xai
Grok 4 Fast Reasoning
xAI's previous-generation thinking model with 2M context and cost-efficient agentic performance. Combined rapid execution with reasoning capabilities.
xai
Grok 4.1 Fast
xAI's speed demon for agentic workflows requiring rapid, accurate tool execution. With an extraordinary 2M context window, this model processes massive amounts of information while maintaining blazing-fast response times. Optimized specifically for tool calling and task completion, it excels at real-world applications like customer support, financial analysis, and automated workflows where speed is critical. The non-reasoning variant prioritizes quick responses over extended thought processes. Choose this when you need rapid, reliable agent performance at incredible value.
xai
Grok 4.1 Fast Reasoning
xAI's intelligent agent combining massive context, thinking capabilities, and tool mastery. With a 2M context window and reasoning mode, this model thoughtfully navigates complex agentic workflows while maintaining speed. The perfect balance between rapid execution and intelligent decision-making for sophisticated real-world applications. Excels at scenarios requiring both tool orchestration and reasoning?like nuanced customer support, complex financial analysis, and adaptive workflows. Choose this when your agents need to think and act intelligently.
xai
Grok Code Fast 1
xAI's lightweight agentic coding model designed for rapid, budget-friendly reasoning with interleaved tool-calling and reasoning traces. Proficient in TypeScript, Python, Java, Rust, C++, and Go. Built for the modern development loop of planning, writing, testing, and debugging. Excels at zero-to-one projects, codebase Q&A, bug fixes, and agentic coding workflows at 4x speed and 1/10th the cost of competing models.
nousresearch
Hermes 2 Pro Llama 3 8B
Hermes 2 Pro is an upgraded, retrained version of Nous Hermes 2, consisting of an updated and cleaned version of the OpenHermes 2.5 Dataset, as well as a newly introduced Function Calling and JSON Mode dataset developed in-house.
nousresearch
Hermes 3 405B Instruct
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. Hermes 3 405B is a frontier-level, full-parameter finetune of the Llama-3.1 405B foundation model, focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills. Hermes 3 is competitive, if not superior, to Llama-3.1 Instruct models at general capabilities, with varying strengths and weaknesses attributable between the two.
nousresearch
Hermes 3 70B Instruct
Hermes 3 is a generalist language model with many improvements over Hermes 2, including advanced agentic capabilities, much better roleplaying, reasoning, multi-turn conversation, long context coherence, and improvements across the board. Hermes 3 70B is a competitive, if not superior finetune of the Llama-3.1 70B foundation model, focused on aligning LLMs to the user, with powerful steering capabilities and control given to the end user. The Hermes 3 series builds and expands on the Hermes 2 set of capabilities, including more powerful and reliable function calling and structured output capabilities, generalist assistant capabilities, and improved code generation skills.
nousresearch
Hermes 4 405B
Hermes 4 is a large-scale reasoning model built on Meta-Llama-3.1-405B and released by Nous Research. It introduces a hybrid reasoning mode, where the model can choose to deliberate internally with <think>...</think> traces or respond directly, offering flexibility between speed and depth. Users can control the reasoning behaviour with the `reasoning` `enabled` boolean. The model is instruction-tuned with an expanded post-training corpus (~60B tokens) emphasizing reasoning traces, improving performance in math, code, STEM, and logical reasoning, while retaining broad assistant utility. It also supports structured outputs, including JSON mode, schema adherence, function calling, and tool use. Hermes 4 is trained for steerability, lower refusal rates, and alignment toward neutral, user-directed behavior.
nousresearch
Hermes 4 70B
Hermes 4 70B is a hybrid reasoning model from Nous Research, built on Meta-Llama-3.1-70B. It introduces the same hybrid mode as the larger 405B release, allowing the model to either respond directly or generate explicit <think>...</think> reasoning traces before answering. Users can control the reasoning behaviour with the `reasoning` `enabled` boolean. This 70B variant is trained with the expanded post-training corpus (~60B tokens) emphasizing verified reasoning data, leading to improvements in mathematics, coding, STEM, logic, and structured outputs while maintaining general assistant performance. It supports JSON mode, schema adherence, function calling, and tool use, and is designed for greater steerability with reduced refusal rates.
tencent
Hunyuan A13B Instruct
Hunyuan-A13B is a 13B active parameter Mixture-of-Experts (MoE) language model developed by Tencent, with a total parameter count of 80B and support for reasoning via Chain-of-Thought. It offers competitive benchmark performance across mathematics, science, coding, and multi-turn reasoning tasks, while maintaining high inference efficiency via Grouped Query Attention (GQA) and quantization support (FP8, GPTQ, etc.).
inflection
Inflection 3 Pi
Inflection 3 Pi powers Inflection's Pi chatbot, including backstory, emotional intelligence, productivity, and safety. It has access to recent news, and excels in scenarios like customer support and roleplay. Pi has been trained to mirror your tone and style, if you use more emojis, so will Pi! Try experimenting with various prompts and conversation styles.
inflection
Inflection 3 Productivity
Inflection 3 Productivity is optimized for following instructions. It is better for tasks requiring JSON output or precise adherence to provided guidelines. It has access to recent news.
prime-intellect
INTELLECT-3
INTELLECT-3 is a 106B-parameter Mixture-of-Experts model (12B active) post-trained from GLM-4.5-Air-Base using supervised fine-tuning (SFT) followed by large-scale reinforcement learning (RL). It offers state-of-the-art performance for its size across math, code, science, and general reasoning, consistently outperforming many larger frontier models. Designed for strong multi-step problem solving, it maintains high accuracy on structured tasks while remaining efficient at inference thanks to its MoE architecture.
opengvlab
InternVL3 78B
The InternVL3 series is an advanced multimodal large language model (MLLM). Compared to InternVL 2.5, InternVL3 demonstrates stronger multimodal perception and reasoning capabilities. In addition, InternVL3 is benchmarked against the Qwen2.5 Chat models, whose pre-trained base models serve as the initialization for its language component. Benefiting from Native Multimodal Pre-Training, the InternVL3 series surpasses the Qwen2.5 series in overall text performance.
ai21
Jamba Large 1.7
Jamba Large 1.7 is the latest model in the Jamba open family, offering improvements in grounding, instruction-following, and overall efficiency. Built on a hybrid SSM-Transformer architecture with a 256K context window, it delivers more accurate, contextually grounded responses and better steerability than previous versions.
kwaipilot
KAT Coder Pro V1
KAT-Coder-Pro V1 is KwaiKAT's most advanced agentic coding model in the KAT-Coder series. Designed specifically for agentic coding tasks, it excels in real-world software engineering scenarios, achieving 73.4% solve rate on the SWE-Bench Verified benchmark. The model has been optimized for tool-use capability, multi-turn interaction, instruction following, generalization, and comprehensive capabilities through a multi-stage training process, including mid-training, supervised fine-tuning (SFT), reinforcement fine-tuning (RFT), and scalable agentic RL.
moonshotai
Kimi K2
Kimi K2 is a Mixture-of-Experts (MoE) foundation model with 1 trillion total parameters and 32 billion activated parameters. Outperforms leading open-source models across general knowledge reasoning, programming, mathematics, and agent tasks. Context length 256K with automatic context caching, ToolCalls, JSON Mode, Partial Mode, and internet search support.
moonshotai
Kimi K2 Thinking
A thinking model built on the Kimi K2 foundation with general agentic and reasoning capabilities, specializing in deep reasoning tasks. With a 262K context window, it combines chain-of-thought reasoning with tool calling for complex problem-solving. Supports automatic context caching, ToolCalls, JSON Mode, Partial Mode, and internet search.
moonshotai
Kimi K2 Thinking Turbo
The ultimate autonomous thinking agent capable of executing hundreds of sequential tool calls with coherent reasoning throughout. This model can chain 200-300 tool operations without human intervention, maintaining logical consistency across complex multi-step problems. Built specifically as a thinking agent, it reasons step-by-step while acting, achieving state-of-the-art results on the hardest benchmarks. With a massive 262K equal input/output window, it handles truly extensive workflows. The Turbo variant delivers this capability at exceptional speed. For complex autonomous projects requiring persistent reasoning and action.
moonshotai
Kimi K2 Turbo
High-speed version of Kimi K2, always aligned with the latest kimi-k2. Same model parameters with output speed up to 60 tokens/sec (max 100 tokens/sec). Context length 262K with automatic context caching, ToolCalls, JSON Mode, Partial Mode, and internet search support.
moonshotai
Kimi K2.5
Kimi's most versatile model featuring a native multimodal architecture that supports both visual and text input. Combines thinking and non-thinking modes with dialogue and agent capabilities. With a 262K context window and massive 252K output capacity, it handles complex multimodal workflows at an exceptional price point.
liquid
LFM 2 24B A2B
LFM2-24B-A2B is the largest model in the LFM2 family of hybrid architectures designed for efficient on-device deployment. Built as a 24B parameter Mixture-of-Experts model with only 2B active parameters per token, it delivers high-quality generation while maintaining low inference costs. The model fits within 32 GB of RAM, making it practical to run on consumer laptops and desktops without sacrificing capability.
liquid
LFM 2 8B A1B
LFM2-8B-A1B is an efficient on-device Mixture-of-Experts (MoE) model from Liquid AI’s LFM2 family, built for fast, high-quality inference on edge hardware. It uses 8.3B total parameters with only ~1.5B active per token, delivering strong performance while keeping compute and memory usage low—making it ideal for phones, tablets, and laptops.
liquid
LFM 2.2 6B
LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency.
liquid
LFM 2.5 1.2B Instruct
LFM2.5-1.2B-Instruct is a compact, high-performance instruction-tuned model built for fast on-device AI. It delivers strong chat quality in a 1.2B parameter footprint, with efficient edge inference and broad runtime support.
liquid
LFM 2.5 1.2B Thinking
LFM2.5-1.2B-Thinking is a lightweight reasoning-focused model optimized for agentic tasks, data extraction, and RAG—while still running comfortably on edge devices. It supports long context (up to 32K tokens) and is designed to provide higher-quality “thinking” responses in a small 1.2B model.
meta
Llama 3 70B Instruct
Meta's Llama 3 70B instruct-tuned version was optimized for high quality dialogue usecases. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3 8B Instruct
Meta's Llama 3 8B instruct-tuned version was optimized for high quality dialogue usecases. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
sao10k
Llama 3 8B Lunaris
Lunaris 8B is a versatile generalist and roleplaying model based on Llama 3. It's a strategic merge of multiple models, designed to balance creativity with improved logic and general knowledge. This model aims to offer an improved experience over Stheno v3.2, with enhanced creativity and logical reasoning. For best results, use with Llama 3 Instruct context template, temperature 1.4, and min_p 0.1.
sao10k
Llama 3 Euryale 70B v2.1
Euryale 70B v2.1 is a model focused on creative roleplay. Better prompt adherence. Better anatomy / spatial awareness. Adapts much better to unique and custom formatting / reply formats. Very creative, lots of unique swipes. Is not restrictive during roleplays.
meta
Llama 3.1 405B
Meta's Llama 3.1 405B base pre-trained model. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3.1 405B Instruct
Meta's Llama 3.1 405B instruct-tuned version is optimized for high quality dialogue usecases with 128K context. It has demonstrated strong performance compared to leading closed-source models including GPT-4o and Claude 3.5 Sonnet in evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
sao10k
Llama 3.1 70B Hanami x1
An experiment over Euryale v2.2.
meta
Llama 3.1 70B Instruct
Meta's Llama 3.1 70B instruct-tuned version is optimized for high quality dialogue usecases with 128K context. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3.1 8B Instruct
Meta's Llama 3.1 8B instruct-tuned version is fast and efficient. It has demonstrated strong performance compared to leading closed-source models in human evaluations. Usage of this model is subject to Meta's Acceptable Use Policy.
sao10k
Llama 3.1 Euryale 70B v2.2
Euryale L3.1 70B v2.2 is a model focused on creative roleplay. It is the successor of Euryale L3 70B v2.1.
nvidia
Llama 3.1 Nemotron 70B Instruct
NVIDIA's Llama 3.1 Nemotron 70B is a language model designed for generating precise and useful responses. Leveraging Llama 3.1 70B architecture and Reinforcement Learning from Human Feedback (RLHF), it excels in automatic alignment benchmarks. This model is tailored for applications requiring high accuracy in helpfulness and response generation, suitable for diverse user queries across multiple domains. Usage of this model is subject to Meta's Acceptable Use Policy.
nvidia
Llama 3.1 Nemotron Ultra 253B v1
Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) optimized for advanced reasoning, human-interactive chat, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta’s Llama-3.1-405B-Instruct, it has been significantly customized using Neural Architecture Search (NAS), resulting in enhanced efficiency, reduced memory usage, and improved inference latency. The model supports a context length of up to 128K tokens and can operate efficiently on an 8x NVIDIA H100 node. Note: you must include `detailed thinking on` in the system prompt to enable reasoning.
meta
Llama 3.2 11B Vision Instruct
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3.2 1B Instruct
Llama 3.2 1B is a 1-billion-parameter language model focused on efficiently performing natural language tasks, such as summarization, dialogue, and multilingual text analysis. Its smaller size allows it to operate efficiently in low-resource environments while maintaining strong task performance. Supporting eight core languages and fine-tunable for more, it is ideal for businesses or developers seeking lightweight yet powerful AI solutions that can operate in diverse multilingual settings without the high computational demand of larger models. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3.2 3B Instruct
Llama 3.2 3B is a 3-billion-parameter multilingual large language model, optimized for advanced natural language processing tasks like dialogue generation, reasoning, and summarization. Designed with the latest transformer architecture, it supports eight languages, including English, Spanish, and Hindi, and is adaptable for additional languages. Trained on 9 trillion tokens, the Llama 3.2 3B model excels in instruction-following, complex reasoning, and tool use. Its balanced performance makes it ideal for applications needing accuracy and efficiency in text generation across multilingual settings. Usage of this model is subject to Meta's Acceptable Use Policy.
meta
Llama 3.3 70B
A balanced model combining performance with efficiency for conversational AI. Designed for content creation, enterprise applications, and research with strong language understanding. Handles summarization, classification, sentiment analysis, and code generation.
sao10k
Llama 3.3 Euryale 70B
Euryale L3.3 70B is a model focused on creative roleplay. It is the successor of Euryale L3 70B v2.2.
nvidia
Llama 3.3 Nemotron Super 49B V1.5
Llama-3.3-Nemotron-Super-49B-v1.5 is a 49B-parameter, English-centric reasoning/chat model derived from Meta’s Llama-3.3-70B-Instruct with a 128K context. It’s post-trained for agentic workflows (RAG, tool calling) via SFT across math, code, science, and multi-turn chat, followed by multiple RL stages; Reward-aware Preference Optimization (RPO) for alignment, RL with Verifiable Rewards (RLVR) for step-wise reasoning, and iterative DPO to refine tool-use behavior. A distillation-driven Neural Architecture Search (“Puzzle”) replaces some attention blocks and varies FFN widths to shrink memory footprint and improve throughput, enabling single-GPU (H100/H200) deployment while preserving instruction following and CoT quality. In internal evaluations (NeMo-Skills, up to 16 runs, temp = 0.6, top_p = 0.95), the model reports strong reasoning/coding results, e.g., MATH500 pass@1 = 97.4, AIME-2024 = 87.5, AIME-2025 = 82.71, GPQA = 71.97, LiveCodeBench (24.10–25.02) = 73.58, and MMLU-Pro (CoT) = 79.53. The model targets practical inference efficiency (high tokens/s, reduced VRAM) with Transformers/vLLM support and explicit “reasoning on/off” modes (chat-first defaults, greedy recommended when disabled). Suitable for building agents, assistants, and long-context retrieval systems where balanced accuracy-to-cost and reliable tool use matter.
meta
Llama 4 Maverick 17B
A multimodal model from the Llama 4 collection with MoE architecture for text and image tasks. Designed for multimodal experiences with vision capabilities.
meta
Llama 4 Scout 17B
A compact multimodal model using mixture-of-experts architecture for text and image understanding. Designed for efficient multimodal experiences with vision support.
meta
Llama Guard 3 8B
Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM that generates text indicating whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Llama Guard 3 was aligned to safeguard against the MLCommons standardized hazards taxonomy and designed to support Llama 3.1 capabilities. Specifically, it provides content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls.
meta
Llama Guard 4 12B
Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM generating text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Llama Guard 4 was aligned to safeguard against the standardized MLCommons hazards taxonomy and designed to support multimodal Llama 4 capabilities. Specifically, it combines features from previous Llama Guard models, providing content moderation for English and multiple supported languages, along with enhanced capabilities to handle mixed text-and-image prompts, including multiple images. Additionally, Llama Guard 4 is integrated into the Llama Moderations API, extending robust safety classification to text and images.
meta
LlamaGuard 2 8B
This safeguard model has 8B parameters and is based on the Llama 3 family. It can do both prompt and response classification. LlamaGuard 2 acts as a normal LLM would, generating text that indicates whether the given input/output is safe/unsafe. If deemed unsafe, it will also share the content categories violated. For best results, please use raw prompt input or the completions endpoint, instead of the chat API. Usage of this model is subject to Meta's Acceptable Use Policy.
eleutherai
Llemma 7b
Llemma 7B is a language model for mathematics. It was initialized with Code Llama 7B weights, and trained on the Proof-Pile-2 for 200B tokens. Llemma models are particularly strong at chain-of-thought mathematical reasoning and using computational tools for mathematics, such as Python and formal theorem provers.
meituan
LongCat Flash
LongCat-Flash-Chat is a large-scale Mixture-of-Experts (MoE) model with 560B total parameters, of which 18.6B–31.3B (≈27B on average) are dynamically activated per input. It introduces a shortcut-connected MoE design to reduce communication overhead and achieve high throughput while maintaining training stability through advanced scaling strategies such as hyperparameter transfer, deterministic computation, and multi-stage optimization. This release, LongCat-Flash-Chat, is a non-thinking foundation model optimized for conversational and agentic tasks. It supports long context windows up to 128K tokens and shows competitive performance across reasoning, coding, instruction following, and domain benchmarks, with particular strengths in tool use and complex multi-step interactions.
neversleep
Lumimaid v0.2 8B
Lumimaid v0.2 8B is a finetune of Llama 3.1 8B with a "HUGE step up dataset wise" compared to Lumimaid v0.1. Sloppy chats output were purged. Usage of this model is subject to Meta's Acceptable Use Policy.
arcee
Maestro Reasoning
Maestro Reasoning is Arcee's flagship analysis model: a 32 B-parameter derivative of Qwen 2.5-32 B tuned with DPO and chain-of-thought RL for step-by-step logic. Compared to the earlier 7 B preview, the production 32 B release widens the context window to 128 k tokens and doubles pass-rate on MATH and GSM-8K, while also lifting code completion accuracy. Its instruction style encourages structured "thought → answer" traces that can be parsed or hidden according to user preference. That transparency pairs well with audit-focused industries like finance or healthcare where seeing the reasoning path matters. In Arcee Conductor, Maestro is automatically selected for complex, multi-constraint queries that smaller SLMs bounce.
mistral
Magistral Medium
Our frontier-class reasoning model release candidate September 2025.
mistral
Magistral Small
Our efficient reasoning model released September 2025.
anthracite
Magnum v4 72B
This is a series of models designed to replicate the prose quality of the Claude 3 models, specifically Sonnet and Opus. The model is fine-tuned on top of Qwen2.5 72B.
inception
Mercury
Mercury is a diffusion-based large language model from Inception Labs, designed for ultra-fast inference with sub-second latency. It supports a 128K context window, native tool calling, and structured outputs. Mercury excels at general-purpose reasoning, chat, and agent workflows where speed is paramount.
inception
Mercury 2
Mercury 2 is an extremely fast reasoning LLM and the first reasoning diffusion LLM (dLLM). Instead of generating tokens sequentially, Mercury 2 produces and refines multiple tokens in parallel, achieving over 1,000 tokens/sec on standard GPUs. It supports tunable reasoning levels, 128K context, native tool use, and schema-aligned JSON output. Built for coding workflows where latency compounds, real-time voice and search, and agent loops.
inception
Mercury Coder
Mercury Coder is a diffusion-based code-specialized language model from Inception Labs, optimized for code generation, editing, and completion with ultra-fast inference. It supports a 128K chat context window, native tool calling, and structured outputs. Ideal for coding agents and development workflows where speed and accuracy are critical.
xiaomi
MiMo V2 Flash
MiMo-V2-Flash is an open-source foundation language model developed by Xiaomi. It is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting hybrid attention architecture. MiMo-V2-Flash supports a hybrid-thinking toggle and a 256K context window, and excels at reasoning, coding, and agent scenarios. On SWE-bench Verified and SWE-bench Multilingual, MiMo-V2-Flash ranks as the top #1 open-source model globally, delivering performance comparable to Claude Sonnet 4.5 while costing only about 3.5% as much. Users can control the reasoning behaviour with the `reasoning` `enabled` boolean.
minimax
MiniMax M1
MiniMax-M1 is a large-scale, open-weight reasoning model designed for extended context and high-efficiency inference. It leverages a hybrid Mixture-of-Experts (MoE) architecture paired with a custom "lightning attention" mechanism, allowing it to process long sequences—up to 1 million tokens—while maintaining competitive FLOP efficiency. With 456 billion total parameters and 45.9B active per token, this variant is optimized for complex, multi-step reasoning tasks. Trained via a custom reinforcement learning pipeline (CISPO), M1 excels in long-context understanding, software engineering, agentic tool use, and mathematical reasoning. Benchmarks show strong performance across FullStackBench, SWE-bench, MATH, GPQA, and TAU-Bench, often outperforming other open models like DeepSeek R1 and Qwen3-235B.
minimax
MiniMax M2
A Mixture-of-Experts model with 230B total parameters and only 10B activated per inference, delivering exceptional efficiency. Built for the agentic era with function calling, advanced reasoning, and real-time streaming capabilities. With a 200K shared context window and 128K max output (including chain-of-thought), it handles massive contexts for coding and agentic work. Superseded by MiniMax M2.1 with improved coding and refactoring capabilities.
minimax
MiniMax M2-her
MiniMax M2-her is a dialogue-first model purpose-built for role-playing and immersive multi-turn conversations. Developed from three years of role-play optimization, it excels at intuitive preference alignment (reading between the lines to adapt to user style), dynamic story progression (driving narrative forward with vivid prose), and high-fidelity world experience (maintaining strict coherence with established lore and character voice). Supports rich role settings including system roles, user personas, conversation groups, and example dialogue learning. With a 200K context window and 2K max output, it prioritizes deep emotional connection and long-horizon character consistency over raw output length.
minimax
MiniMax M2.1
A 230B MoE model (10B active) optimized for code generation and refactoring with polyglot programming mastery. Features enhanced reasoning capabilities and precision code refactoring across multiple languages. With a 200K shared context window, 131K max output, and ~60 tps output speed, it handles substantial coding tasks with confidence. Superseded by MiniMax M2.5 with SOTA coding performance.
minimax
MiniMax M2.1 Highspeed
The highspeed variant of MiniMax M2.1, delivering the same polyglot code mastery and precision refactoring at significantly faster inference speeds (~100 tokens per second vs ~60 tps standard). Ideal for latency-sensitive applications and real-time coding assistance. Shares the same 200K shared context window and 131K max output. Superseded by MiniMax M2.5 Highspeed.
minimax
MiniMax M2.5
MiniMax M2.5 is a SOTA large language model designed for real-world productivity, achieving 80.2% on SWE-Bench Verified and 51.3% on Multi-SWE-Bench. A 230B MoE model (10B active) capable of handling the entire development process of complex systems across Web, Android, iOS, Windows, and Mac platforms. Excels at coding, agentic tool use, search, and office productivity. With a 200K shared context window, 131K max output, and ~60 tps output speed, it delivers peak performance at exceptional value.
minimax
MiniMax M2.5 Highspeed
The highspeed variant of MiniMax M2.5, delivering the same SOTA coding performance at significantly faster inference speeds (~100 tokens per second vs ~60 tps standard). Same quality as M2.5 for full-stack development across all platforms with dramatically lower latency. Shares the same 200K shared context window and 131K max output. Ideal for real-time coding assistance and latency-sensitive production deployments.
minimax
MiniMax-01
MiniMax-01 combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context of up to 4 million tokens. The text model adopts a hybrid architecture that combines Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The image model adopts the “ViT-MLP-LLM” framework and is trained on top of the text model.
mistral
Ministral 14B
Ministral 3 (a.k.a. Tinystral) 14B Instruct.
mistral
Ministral 3B
Ministral 3 (a.k.a. Tinystral) 3B Instruct.
mistral
Ministral 8B
Ministral 3 (a.k.a. Tinystral) 8B Instruct.
mistral
Mistral 7B
A 7B transformer model, fast-deployed and easily customisable.
mistral
Mistral Large
Open-weight, general-purpose, flagship multimodal and multilingual model.
mistral
Mistral Medium
Update on Mistral Medium 3 with improved capabilities.
mistral
Mistral Nemo
Our best multilingual open source model released July 2024.
mistral
Mistral Saba
A 24B-parameter language model designed for the Middle East and South Asia, supporting Arabic, Tamil, Malayalam, and other regional languages.
mistral
Mistral Small
Our latest enterprise-grade small model with the latest version released June 2025.
mistral
Mistral Small Creative
Official Mistral Small Creative Mistral AI model
mistral
Mistral Tiny
Our best multilingual open source model released July 2024.
mistral
Mixtral 8x22B
Mixtral 8x22B is currently the most performant open model. A 22B sparse Mixture-of-Experts (SMoE). Uses only 39B active parameters out of 141B.
mistral
Mixtral 8x7B
A 7B sparse Mixture-of-Experts (SMoE). Uses 12.9B active parameters out of 45B total.
allenai
Molmo2 8B
Molmo2-8B is an open vision-language model developed by the Allen Institute for AI (Ai2) as part of the Molmo2 family, supporting image, video, and multi-image understanding and grounding. It is based on Qwen3-8B and uses SigLIP 2 as its vision backbone, outperforming other open-weight, open-data models on short videos, counting, and captioning, while remaining competitive on long-video tasks.
moonshotai
Moonshot V1 128K
Moonshot V1 128K is a legacy text generation model with a 131,072 token context window. Superseded by Kimi K2 with superior coding, reasoning, and agent capabilities.
moonshotai
Moonshot V1 128K Vision
Moonshot V1 128K Vision is a legacy multimodal model with a 131,072 token context window supporting image understanding. Superseded by Kimi K2.5 with native multimodal architecture.
moonshotai
Moonshot V1 32K
Moonshot V1 32K is a legacy text generation model with a 32,768 token context window. Superseded by Kimi K2 with superior coding, reasoning, and agent capabilities.
moonshotai
Moonshot V1 32K Vision
Moonshot V1 32K Vision is a legacy multimodal model with a 32,768 token context window supporting image understanding. Superseded by Kimi K2.5 with native multimodal architecture.
moonshotai
Moonshot V1 8K
Moonshot V1 8K is a legacy text generation model with an 8,192 token context window. Superseded by Kimi K2 with superior coding, reasoning, and agent capabilities.
moonshotai
Moonshot V1 8K Vision
Moonshot V1 8K Vision is a legacy multimodal model with an 8,192 token context window supporting image understanding. Superseded by Kimi K2.5 with native multimodal architecture.
morph
Morph V3 Fast
Morph's fastest apply model for code edits. ~10,500 tokens/sec with 96% accuracy for rapid code transformations. The model requires the prompt to be in the following format: <instruction>{instruction}</instruction> <code>{initial_code}</code> <update>{edit_snippet}</update> Zero Data Retention is enabled for Morph.
morph
Morph V3 Large
Morph's high-accuracy apply model for complex code edits. ~4,500 tokens/sec with 98% accuracy for precise code transformations. The model requires the prompt to be in the following format: <instruction>{instruction}</instruction> <code>{initial_code}</code> <update>{edit_snippet}</update> Zero Data Retention is enabled for Morph.
gryphe
MythoMax 13B
One of the highest performing and most popular fine-tunes of Llama 2 13B, with rich descriptions and roleplay. #merge
nvidia
Nemotron 3 Nano 30B A3B
NVIDIA Nemotron 3 Nano 30B A3B is a small language MoE model with highest compute efficiency and accuracy for developers to build specialized agentic AI systems. The model is fully open with open-weights, datasets and recipes so developers can easily customize, optimize, and deploy the model on their infrastructure for maximum privacy and security. Note: For the free endpoint, all prompts and output are logged to improve the provider's model and its product and services. Please do not upload any personal, confidential, or otherwise sensitive information. This is a trial use only. Do not use for production or business-critical systems.
nvidia
Nemotron Nano 12B 2 VL
NVIDIA Nemotron Nano 2 VL is a 12-billion-parameter open multimodal reasoning model designed for video understanding and document intelligence. It introduces a hybrid Transformer-Mamba architecture, combining transformer-level accuracy with Mamba’s memory-efficient sequence modeling for significantly higher throughput and lower latency. The model supports inputs of text and multi-image documents, producing natural-language outputs. It is trained on high-quality NVIDIA-curated synthetic datasets optimized for optical-character recognition, chart reasoning, and multimodal comprehension. Nemotron Nano 2 VL achieves leading results on OCRBench v2 and scores ≈ 74 average across MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME—surpassing prior open VL baselines. With Efficient Video Sampling (EVS), it handles long-form videos while reducing inference cost. Open-weights, training data, and fine-tuning recipes are released under a permissive NVIDIA open license, with deployment supported across NeMo, NIM, and major inference runtimes.
nvidia
Nemotron Nano 9B V2
NVIDIA-Nemotron-Nano-9B-v2 is a large language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so.
neversleep
Noromaid 20B
A collab between IkariDev and Undi. This merge is suitable for RP, ERP, and general knowledge. #merge #uncensored
amazon
Nova 2 Lite
Amazon's fast, cost-effective reasoning model built for everyday tasks. Nova 2 Lite delivers strong multimodal understanding across text, images, video, and documents with a massive 1M context window and extended thinking capabilities. With built-in tools for code interpretation and web grounding, it handles complex analytical tasks while keeping costs low. Perfect for production applications that need reliable intelligence and broad context without the premium price tag.
amazon
Nova Lite
Amazon's fast, cost-effective multimodal model processing text and images with a generous 300K context window. Nova Lite handles up to multiple images per request and delivers rapid responses for customer interactions, document analysis, and visual understanding tasks. An excellent choice for applications needing broad multimodal capabilities without premium costs.
amazon
Nova Micro
Amazon's fastest and most cost-effective text-only model delivering the lowest latency in the Nova family. Nova Micro excels at text summarization, translation, classification, and simple coding tasks. With a 128K context window and tool support, it handles a broad range of text-based workflows. Perfect for high-volume applications and cost-conscious projects where speed matters most.
amazon
Nova Premier
Amazon's most capable model designed for complex multimodal tasks and agentic workflows. Nova Premier processes text, images, video, and documents with a massive 1M token context window, enabling analysis of extensive content in a single request. It excels at complex reasoning, detailed analysis, and multi-step problem solving where accuracy is paramount. Choose this when you need maximum intelligence from the Nova family.
amazon
Nova Pro
Amazon's balanced multimodal model offering the best combination of accuracy, speed, and cost for general tasks. Nova Pro processes text, images, video, and documents with a 300K context window and delivers strong performance on visual question answering and video understanding benchmarks. The ideal choice when you need reliable multimodal capabilities with a good balance of quality and cost.
openai
o1
The o1 model series is designed to spend more time thinking before responding. Trained with large-scale reinforcement learning to reason using chain of thought. The o1 models are optimized for math, science, programming, and other STEM-related tasks.
openai
o1 Pro
The o-series of models are trained with reinforcement learning to think before they answer and perform complex reasoning. The o1 Pro model uses more compute to think harder and provide consistently better answers.
openai
o3
o3 is a well-rounded and powerful model across domains. It sets a new standard for math, science, coding, and visual reasoning tasks. It also excels at technical writing and instruction-following. Use it to think through multi-step problems that involve analysis across text, code, and images.
openai
o3 Deep Research
o3 Deep Research is OpenAI's advanced model for deep research, designed to tackle complex, multi-step research tasks. This model always uses web search which adds additional cost.
openai
o3 Mini
o3 Mini is a cost-efficient language model optimized for STEM reasoning tasks, particularly excelling in science, mathematics, and coding. The model features three adjustable reasoning effort levels and supports key developer capabilities including function calling, structured outputs, and streaming.
openai
o3 Mini High
o3 Mini High is the same model as o3 Mini with reasoning effort set to high. o3 Mini is a cost-efficient language model optimized for STEM reasoning tasks, particularly excelling in science, mathematics, and coding.
openai
o3 Pro
The o-series of models are trained with reinforcement learning to think before they answer and perform complex reasoning. The o3 Pro model uses more compute to think harder and provide consistently better answers.
openai
o4 Mini
o4 Mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance, outperforming its predecessor o3 Mini and even approaching o3 in some domains.
openai
o4 Mini Deep Research
o4 Mini Deep Research is OpenAI's faster, more affordable deep research model, ideal for tackling complex, multi-step research tasks. This model always uses web search which adds additional cost.
openai
o4 Mini High
o4 Mini High is the same model as o4 Mini with reasoning effort set to high. o4 Mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities.
allenai
Olmo 2 32B Instruct
OLMo-2 32B Instruct is a supervised instruction-finetuned variant of the OLMo-2 32B March 2025 base model. It excels in complex reasoning and instruction-following tasks across diverse benchmarks such as GSM8K, MATH, IFEval, and general NLP evaluation. Developed by AI2, OLMo-2 32B is part of an open, research-oriented initiative, trained primarily on English-language datasets to advance the understanding and development of open-source language models.
allenai
Olmo 3 32B Think
Olmo 3 32B Think is a large-scale, 32-billion-parameter model purpose-built for deep reasoning, complex logic chains and advanced instruction-following scenarios. Its capacity enables strong performance on demanding evaluation tasks and highly nuanced conversational reasoning. Developed by Ai2 under the Apache 2.0 license, Olmo 3 32B Think embodies the Olmo initiative’s commitment to openness, offering full transparency across weights, code and training methodology.
allenai
Olmo 3 7B Instruct
Olmo 3 7B Instruct is a supervised instruction-fine-tuned variant of the Olmo 3 7B base model, optimized for instruction-following, question-answering, and natural conversational dialogue. By leveraging high-quality instruction data and an open training pipeline, it delivers strong performance across everyday NLP tasks while remaining accessible and easy to integrate. Developed by Ai2 under the Apache 2.0 license, the model offers a transparent, community-friendly option for instruction-driven applications.
allenai
Olmo 3 7B Think
Olmo 3 7B Think is a research-oriented language model in the Olmo family designed for advanced reasoning and instruction-driven tasks. It excels at multi-step problem solving, logical inference, and maintaining coherent conversational context. Developed by Ai2 under the Apache 2.0 license, Olmo 3 7B Think supports transparent, fully open experimentation and provides a lightweight yet capable foundation for academic research and practical NLP workflows.
allenai
Olmo 3.1 32B Instruct
Olmo 3.1 32B Instruct is a large-scale, 32-billion-parameter instruction-tuned language model engineered for high-performance conversational AI, multi-turn dialogue, and practical instruction following. As part of the Olmo 3.1 family, this variant emphasizes responsiveness to complex user directions and robust chat interactions while retaining strong capabilities on reasoning and coding benchmarks. Developed by Ai2 under the Apache 2.0 license, Olmo 3.1 32B Instruct reflects the Olmo initiative’s commitment to openness and transparency.
allenai
Olmo 3.1 32B Think
Olmo 3.1 32B Think is a large-scale, 32-billion-parameter model designed for deep reasoning, complex multi-step logic, and advanced instruction following. Building on the Olmo 3 series, version 3.1 delivers refined reasoning behavior and stronger performance across demanding evaluations and nuanced conversational tasks. Developed by Ai2 under the Apache 2.0 license, Olmo 3.1 32B Think continues the Olmo initiative’s commitment to openness, providing full transparency across model weights, code, and training methodology.
writer
Palmyra X5
Palmyra X5 is Writer's most advanced model, purpose-built for building and scaling AI agents across the enterprise. It delivers industry-leading speed and efficiency on context windows up to 1 million tokens, powered by a novel transformer architecture and hybrid attention mechanisms. This enables faster inference and expanded memory for processing large volumes of enterprise data, critical for scaling AI agents.
microsoft
Phi 4
Phi-4 is designed to perform well in complex reasoning tasks and can operate efficiently in situations with limited memory or where quick responses are needed. At 14 billion parameters, it was trained on a mix of high-quality synthetic datasets, data from curated websites, and academic materials. It has undergone careful improvement to follow instructions accurately and maintain strong safety standards. It works best with English language inputs.
mistral
Pixtral 12B
A compact 12B multimodal model with image understanding alongside text capabilities.
mistral
Pixtral Large
Official pixtral-large-2411 Mistral AI model
alibaba
QVQ Max
The Tongyi Qianwen QVQ visual reasoning model supports visual input and chain-of-thought output, demonstrating stronger capabilities in mathematics, programming, visual analysis, creation, and general tasks. This model is a historical snapshot of QVQ-Max from March 25, 2025, and is expected to be maintained for one month after the release date of the next snapshot version (to be determined).
alibaba
Qwen 3 Max Thinking
The most capable Qwen reasoning model, integrating thinking and non-thinking modes for comprehensive problem-solving. In thinking mode, it combines deliberate reasoning with web search, web extraction, and code interpreter tools to tackle complex challenges with greater accuracy. With a 256K context window and 65K output tokens, this model excels at problems requiring both deep thought and external tool use.
alibaba
Qwen Flash
The Qwen3 Flash model (snapshot 2025-07-28) offers a powerful fusion of thinking and non-thinking modes with dynamic in-conversation switching, excelling in complex reasoning while showing significant gains in instruction following and text comprehension. It supports a 1M context length and is billed on a tiered model corresponding to context usage.
alibaba
Qwen Flash Character
The Qwen Role-Playing Model Series is specifically optimized for muti-language anthropomorphic interaction scenarios. It demonstrates advanced capabilities in character consistency maintenance, context-aware dialogue progression, and empathetic engagement, enabling precise personalized character embodiment. This version significantly enhances Japanese linguistic localization (including dialects and honorifics), human-like role-playing authenticity, narrative coherence control, and scenario-based cognitive intelligence.
alibaba
Qwen Max
The most effective model in the Qwen series, with improved code capabilities in both Chinese and English, logical abilities, and multilingual capabilities. The model's response detail and format clarity have been improved, as well as its creative abilities, JSON format compliance, and role-playing capabilities. This model is a snapshot version of Qwen-Max from January 25, 2025, and is expected to be maintained until one month before the next snapshot goes live.
alibaba
Qwen MT Flash
Qwen-MT-Flash, a large language model from the Qwen series, has been fully upgraded with the Qwen 3 architecture for significantly enhanced performance and translation quality. It provides rapid, cost-effective translation across 92 languages, while supporting advanced features such as terminology intervention, format preservation, and domain-specific adaptation. It is the ideal choice for applications requiring a powerful balance of speed, quality, and cost.
alibaba
Qwen MT Lite
Qwen-MT-Lite is a large language model of the Qwen model series that specializes in multi-lingual translation. It provides high-quality and rapid translation services across 32 languages at a cost-effective price. It offers features such as terminology intervention, format preservation, and domain-specific translation to cater to the diverse needs of various applications, ensuring both efficiency and performance.
alibaba
Qwen MT Plus
Qwen-MT-Plus, the flagship translation model from our Qwen series, is now fully upgraded with the Qwen3 architecture. It supports 92 languages and delivers exceptionally accurate and natural-sounding translations. Its advanced capabilities in contextual understanding, terminology control, and format preservation make it a superior choice over traditional models, especially for specialized domains.
alibaba
Qwen MT Turbo
Qwen-MT-Turbo, a large language model from the Qwen series, has been fully upgraded with the Qwen 3 architecture for significantly enhanced performance and translation quality. It provides rapid, cost-effective translation across 92 languages, while supporting advanced features such as terminology intervention, format preservation, and domain-specific adaptation. It is the ideal choice for applications requiring a powerful balance of speed, quality, and cost.
alibaba
Qwen Plus
The Qwen series of models, which are well-balanced in capabilities, offer reasoning performance and speed that fall between Qwen-Max and Qwen-Turbo, making them suitable for moderately complex tasks. Compared to previous versions, it shows significant improvements in both Chinese and English code generation, logical reasoning, and multilingual abilities. The response style has been greatly adjusted to align with human preferences, with noticeable enhancements in the level of detail and clarity of responses. Specialized improvements have been made in creative writing, adherence to JSON formatting, and role-playing abilities.
alibaba
Qwen Plus Character
The role-playing model of the Qwen series. This is a dynamically updated version, and notifications will be provided in advance for any model updates. It is suitable for anthropomorphic role-playing and has optimized capabilities in following predefined character instructions, advancing conversations, and demonstrating active listening and empathy. Additionally, it supports the deep restoration of personalized characters.
alibaba
Qwen Plus Character Japanese
The Qwen Role-Playing Model Series is specifically optimized for Japanese anthropomorphic interaction scenarios. It demonstrates advanced capabilities in character consistency maintenance, context-aware dialogue progression, and empathetic engagement, enabling precise personalized character embodiment. This version significantly enhances Japanese linguistic localization (including dialects and honorifics), human-like role-playing authenticity, narrative coherence control, and scenario-based cognitive intelligence.
alibaba
Qwen Plus Thinking
Qwen Plus 0728, based on the Qwen3 foundation model, is a 1 million context hybrid reasoning model with a balanced performance, speed, and cost combination.
alibaba
Qwen QwQ Plus
Qwen QwQ Reasoning Model Enhanced Edition, based on the Qwen2.5 model training QwQ reasoning model, through reinforcement learning greatly improved the model's reasoning ability. The core indicators of the model such as mathematics code (AIME 24/25, livecodebench) and some general indicators (IFEval, LiveBench, etc.) have reached the level of DeepSeek-R1 full-blooded edition.
alibaba
Qwen Turbo
The Qwen3 Turbo is a new hybrid reasoning model enabling dynamic switching between reasoning and non-reasoning modes mid-dialogue. With fewer parameters, it rivals the QwQ-32B in reasoning performance while surpassing the Qwen2.5-Turbo in general capabilities, achieving state-of-the-art (SOTA) performance at its scale. This model is a snapshot version as of April 28, 2025
alibaba
Qwen VL Max
The model has improved math and reasoning capabilities, with the response style adjusted to better align with human preferences. The clarity and detail of responses have been significantly enhanced. This is the snapshot version as of April 8, 2025.
alibaba
Qwen VL Plus
This model is a snapshot version of Qwen-VL-Plus as of August 15, 2025. It approaches the general capabilities of Qwen2.5-VL-32B, with improved performance in object and person recognition, enhanced accuracy in real-world scenarios, and reduced hallucinations.
alibaba
Qwen2.5 14B Instruct
Qwen2.5-14B-Instruct is an open source instruction-tuned model that has 14 billion parameters. This model supports a context length of up to 131,072 tokens. To ensure smooth operation and output, the API of this model limits user maximum input to 129,024 tokens and the maximum output to 8,192 tokens.
alibaba
Qwen2.5 14B Instruct 1M
The 14B model of the Qwen2.5 series has gained significantly more knowledge compared to Qwen2, and has greatly improved in programming and mathematical abilities. Additionally, the new model has made improvements in executing instructions, generating long texts, understanding structured data (such as tables), and generating structured outputs, particularly JSON. It supports a context of 1M tokens.
alibaba
Qwen2.5 32B Instruct
Qwen2.5-32B-Instruct is an open source instruction-tuned model that has 32 billion parameters. This model supports a context length of up to 131,072 tokens. To ensure smooth operation and output, the API of this model limits user maximum input to 129,024 tokens and the maximum output to 8,192 tokens.
alibaba
Qwen2.5 72B Instruct
Qwen2.5-72B-Instruct is an open source instruction-tuned model that has 72 billion parameters. This model supports a context length of up to 131,072 tokens. To ensure smooth operation and output, the API of this model limits user maximum input to 129,024 tokens and the maximum output to 8,192 tokens.
alibaba
Qwen2.5 7B Instruct
Qwen2.5-7B-Instruct is an open source instruction-tuned model that has 7 billion parameters. This model supports a context length of up to 131,072 tokens. To ensure smooth operation and output, the API of this model limits user maximum input to 129,024 tokens and the maximum output to 8,192 tokens.
alibaba
Qwen2.5 7B Instruct 1M
The 7B model of the Qwen2.5 series has gained significantly more knowledge compared to Qwen2, and has greatly improved in programming and mathematical abilities. Additionally, the new model has made improvements in executing instructions, generating long texts, understanding structured data (such as tables), and generating structured outputs, particularly JSON. It supports a context of 1M tokens.
alibaba
Qwen2.5 Coder 32B Instruct
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). Qwen2.5-Coder brings significant improvements in code generation, code reasoning, and code fixing, as well as a more comprehensive foundation for real-world applications such as Code Agents, enhancing coding capabilities while maintaining its strengths in mathematics and general competencies.
alibaba
Qwen2.5 Coder 7B Instruct
Qwen2.5-Coder-7B-Instruct is a 7B parameter instruction-tuned language model optimized for code-related tasks such as code generation, reasoning, and bug fixing. Based on the Qwen2.5 architecture, it incorporates enhancements like RoPE, SwiGLU, RMSNorm, and GQA attention with support for up to 128K tokens using YaRN-based extrapolation. It is trained on a large corpus of source code, synthetic data, and text-code grounding, providing robust performance across programming languages and agentic coding workflows. This model is part of the Qwen2.5-Coder family and offers strong compatibility with tools like vLLM for efficient deployment. Released under the Apache 2.0 license.
alibaba
Qwen2.5 VL 32B Instruct
The Qwen2.5VL series model has reached a level close to Qwen2.5VL-72B in answering math and subject questions, with the reply style significantly adjusted towards human preferences, especially for objective queries such as mathematics, logical reasoning, knowledge Q&A, etc., where the model's detailed responses and clear formatting have been notably improved. This version is the 32B version.
alibaba
Qwen2.5 VL 3B Instruct
Qwen2.5-VL open-source series model, with improved instruction following, mathematics, problem-solving, and coding capabilities, enhanced ability to recognize all things, supports diverse formats for direct and precise location of visual elements, supports understanding of up to one hour of video and second-level event moment positioning, can understand time sequence and speed, based on parsing and positioning capabilities supports manipulation of OS or Mobile agents, strong key information extraction capability and Json format output capability, this version is a 3B version suitable for mobile device use.
alibaba
Qwen2.5 VL 72B Instruct
Qwen2.5-VL open-source series model, with improved instruction following, mathematics, problem-solving, and coding capabilities, enhanced ability to recognize all things, supports diverse formats for direct and precise positioning of visual elements, supports understanding of up to one hour of video and second-level event moment positioning, can understand time sequence and speed, based on parsing and positioning capabilities supports controlling OS or Mobile agents, strong key information extraction capabilities and Json format output capabilities, this version is the 72B version, the most powerful version in this series.
alibaba
Qwen2.5 VL 7B Instruct
The open-source series model Qwen2.5-VL, with enhanced instruction following, mathematics, problem-solving, and coding capabilities, has improved its ability to recognize all things, supports diverse formats for direct and precise positioning of visual elements, supports understanding of videos up to one hour long and second-level event moment positioning, can understand time sequence and speed, based on parsing and positioning capabilities supports controlling OS or Mobile Agents, has strong key information extraction capabilities and Json format output capabilities, this version is the 7B version, with relatively balanced computational volume and performance.
alibaba
Qwen3 0.6B
The Qwen3 hybrid reasoning model supports seamless switching between thinking and non-thinking modes during conversations. It outperforms the Qwen2.5 small-scale series in general capabilities.
alibaba
Qwen3 1.7B
The Qwen3 hybrid reasoning model supports seamless switching between thinking and non-thinking modes during conversations. It outperforms the Qwen2.5 small-scale series in general capabilities, with stronger human preference alignment and notable gains in creative writing, role-playing, multi-turn dialogue, and instruction following—resulting in a significantly improved user experience.
alibaba
Qwen3 14B
Qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations. It achieves SOTA reasoning performance at its scale and significantly outperforms Qwen2.5-14B in general capabilities.
alibaba
Qwen3 235B
A powerful general-purpose model perfect for complex reasoning tasks, instruction following, and multilingual projects. With 235 billion parameters, this model excels at understanding nuanced instructions and maintaining context across long conversations. Ideal for research, creative writing, and sophisticated problem-solving where you need reliable, well-reasoned responses.
alibaba
Qwen3 235B A22B
Qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations. It delivers strong reasoning performance with fewer parameters, comparable to QwQ, and significantly outperforms Qwen2.5-72B-Instruct in general capabilities, achieving state-of-the-art (SOTA) results for its scale.
alibaba
Qwen3 235B A22B Instruct
Compared to its predecessor (Qwen3-235B-A22B), the latest open-source Qwen3 model (non-thinking mode) delivers modest improvements in creative performance and model safety.
alibaba
Qwen3 235B A22B Instruct 2507
Compared to its predecessor (Qwen3-235B-A22B), the latest open-source Qwen3 model (non-thinking mode) delivers modest improvements in creative performance and model safety.
alibaba
Qwen3 235B A22B Thinking
Built upon the Qwen3 framework, this open-source reasoning model offers substantial improvements over its predecessor (Qwen3-235B-A22B) in logic, general capabilities, knowledge, and creativity, making it ideal for highly complex, reasoning-intensive scenarios.
alibaba
Qwen3 235B A22B Thinking 2507
Built upon the Qwen3 framework, this open-source reasoning model offers substantial improvements over its predecessor (Qwen3-235B-A22B) in logic, general capabilities, knowledge, and creativity, making it ideal for highly complex, reasoning-intensive scenarios.
alibaba
Qwen3 30B A3B
Qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations. It delivers strong reasoning performance with fewer parameters, comparable to QwQ-32B, and significantly outperforms Qwen2.5-14B in general capabilities, achieving state-of-the-art (SOTA) results for its scale.
alibaba
Qwen3 30B A3B Instruct
Qwen3 open-source non-thinking model. As an advanced successor to Qwen3-30B-A3B, this model delivers substantial improvements in overall general capabilities across Chinese, English, and multiple languages. Furthermore, it has been specifically optimized for subjective, open-ended tasks, providing responses that better align with user preferences and offer significantly greater helpfulness.
alibaba
Qwen3 30B A3B Instruct 2507
Qwen3 open-source non-thinking model. As an advanced successor to Qwen3-30B-A3B, this model delivers substantial improvements in overall general capabilities across Chinese, English, and multiple languages. Furthermore, it has been specifically optimized for subjective, open-ended tasks, providing responses that better align with user preferences and offer significantly greater helpfulness.
alibaba
Qwen3 30B A3B Thinking
Qwen3 Open-Source Reasoning Model. As an advanced successor to Qwen3-30B-A3B, this model features superior complex reasoning, excelling in challenging tasks such as logic, mathematics, science, and coding. Additionally, it demonstrates significant improvements in core capabilities, including instruction following, text comprehension, and multilingual translation.
alibaba
Qwen3 30B A3B Thinking 2507
Qwen3 Open-Source Reasoning Model. As an advanced successor to Qwen3-30B-A3B, this model features superior complex reasoning, excelling in challenging tasks such as logic, mathematics, science, and coding. Additionally, it demonstrates significant improvements in core capabilities, including instruction following, text comprehension, and multilingual translation.
alibaba
Qwen3 32B
Qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations. It delivers strong reasoning performance with fewer parameters, comparable to QwQ, and significantly outperforms Qwen2.5-32B-Instruct in general capabilities, achieving state-of-the-art (SOTA) results for its scale.
alibaba
Qwen3 4B
This qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations, achieving SOTA reasoning performance at its scale. It shows significant improvements in human preference alignment, creative writing, role-playing, multi-turn dialogue, and instruction following, delivering a greatly enhanced user experience.
alibaba
Qwen3 8B
Qwen3 hybrid reasoning model enables seamless switching between thinking and non-thinking modes during conversations. It achieves SOTA reasoning performance at its scale and significantly outperforms Qwen2.5-7B in general capabilities.
alibaba
Qwen3 Coder 30B A3B
Your budget-friendly coding companion designed for everyday development work. This model specializes in code generation, debugging, and refactoring with strong tool integration. With a massive 262K context window, it can handle entire codebases and long documentation. Perfect for daily coding tasks where you need fast, reliable assistance without breaking the bank.
alibaba
Qwen3 Coder 30B A3B Instruct
The Qwen3-based code generation model, inheriting the coding agent capabilities of Qwen3-Coder-480B-A35B-Instruct, achieves State-of-the-Art (SOTA) coding performance among models of comparable size.
alibaba
Qwen3 Coder 480B A35B
Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts (MoE) code generation model developed by the Qwen team. It is optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning over repositories. The model features 480 billion total parameters, with 35 billion active per forward pass (8 out of 160 experts). Pricing for the Alibaba endpoints varies by context length. Once a request is greater than 128k input tokens, the higher pricing is used.
alibaba
Qwen3 Coder 480B A35B Instruct
Powered by Qwen3, this code generation model is a powerful Coding Agent, achieving state-of-the-art (SOTA) performance among open-source models.
alibaba
Qwen3 Coder Flash
Based on Qwen3, this code generation model inherits the coding agent capabilities of Qwen3-Coder-Plus and supports multi-turn tool interaction. It features focused optimizations on repository-level understanding and enhanced tool-calling stability. This version is a snapshot dated July 28, 2025.
alibaba
Qwen3 Coder Next
Qwen3-Coder-Next is an open-weight causal language model optimized for coding agents and local development workflows. It uses a sparse MoE design with 80B total parameters and only 3B activated per token, delivering performance comparable to models with 10 to 20x higher active compute, which makes it well suited for cost-sensitive, always-on agent deployment. The model is trained with a strong agentic focus and performs reliably on long-horizon coding tasks, complex tool usage, and recovery from execution failures. With a native 256k context window, it integrates cleanly into real-world CLI and IDE environments and adapts well to common agent scaffolds used by modern coding tools. The model operates exclusively in non-thinking mode and does not emit <think> blocks, simplifying integration for production coding agents.
alibaba
Qwen3 Coder Plus
Qwen3-based code generation model with strong coding agent power, excels at tool calling and environment interaction, capable of autonomous programming with outstanding code capability while maintaining general ability. This is a snapshot from 23 September , 2025.Compared to the previous version (snapshot from July 22), it demonstrates improved robustness in downstream task performance and tool invocation, along with enhanced code security.
alibaba
Qwen3 Max
Compared with the snapshot as of September 23, 2025, the Qwen-3 series Max model in this release achieves an effective integration of thinking and non-thinking modes, resulting in a comprehensive and substantial improvement in the model’s overall performance. In thinking mode, the model simultaneously supports web search, web information extraction, and a code interpreter tool, enabling it to tackle more complex and challenging problems with greater accuracy by leveraging external tools while engaging in slow, deliberative reasoning. This version is based on a snapshot taken on January 23, 2026.
alibaba
Qwen3 Next 80B A3B Instruct
A new generation of open-source, non-thinking mode model powered by Qwen3. This version demonstrates superior Chinese text understanding, augmented logical reasoning, and enhanced capabilities in text generation tasks over the previous iteration (Qwen3-235B-A22B-Instruct-2507).
alibaba
Qwen3 Next 80B A3B Thinking
A new generation of Qwen3-based open-source thinking mode models. This version offers improved instruction following and streamlined summary responses over the previous iteration (Qwen3-235B-A22B-Thinking-2507).
alibaba
Qwen3 VL 235B A22B Instruct
The Qwen3 series VL models has been comprehensively upgraded in areas such as visual coding and spatial perception. Its visual perception and recognition capabilities have significantly improved, supporting the understanding of ultra-long videos, and its OCR functionality has undergone a major enhancement.
alibaba
Qwen3 VL 235B A22B Thinking
Qwen3 series VL models feature significantly enhanced multimodal reasoning capabilities, with a particular focus on optimizing the model for STEM and mathematical reasoning. Visual perception and recognition abilities have been comprehensively improved, and OCR capabilities have undergone a major upgrade.
alibaba
Qwen3 VL 30B A3B Instruct
Qwen3-VL's second-largest MoE model delivers fast responses and supports ultra-long contexts (e.g., long videos and documents). It enhances image/video understanding, spatial perception, and object recognition, and includes 2D/3D visual localization to handle complex real-world tasks.
alibaba
Qwen3 VL 30B A3B Thinking
The "Thinking" edition of Qwen3-VL's second-largest MoE model offers fast response, enhanced multimodal understanding and reasoning, visual agent capabilities, and ultra-long context support (e.g., long videos and documents). It improves image/video comprehension, spatial perception, and object recognition to handle complex real-world tasks.
alibaba
Qwen3 VL 32B Instruct
The largest dense model in the Qwen3-VL series, in its non-inference version, delivers overall performance second only to Qwen3-VL-235B-Instruct. It excels in document recognition and comprehension, demonstrates strong spatial awareness and object identification capabilities, and achieves state-of-the-art performance in 2D visual detection and spatial reasoning. It is well-suited for complex perception tasks across a wide range of general-purpose scenarios.
alibaba
Qwen3 VL 32B Thinking
The largest dense model in the Qwen3-VL series, its reasoning version boasts multimodal reasoning capabilities second only to Qwen3-VL-235B-Thinking. It excels in STEM and math problem-solving, general image and video understanding, and achieves state-of-the-art performance in multimodal agent capabilities, making it ideal for complex multimodal reasoning tasks.
alibaba
Qwen3 VL 8B Instruct
Qwen3-VL 8B Dense model has a reduced memory footprint and delivers comprehensive improvements in image/video understanding, ultra-long context support (e.g., long videos and documents), spatial perception, and object recognition, enabling it to handle complex real-world tasks.
alibaba
Qwen3 VL 8B Thinking
The "Thinking" edition of Qwen3-VL 8B Dense has a reduced memory footprint, enabling multimodal understanding and reasoning. It supports ultra-long contexts (e.g., long videos and documents), 2D/3D visual localization, and enhances image/video comprehension, spatial perception, and object recognition.
alibaba
Qwen3 VL Flash
The Qwen3 series of small-sized visual understanding models effectively integrates thinking and non-thinking modes. Compared with the snapshot taken on October 15, 2025, the overall performance of the model has improved significantly: it delivers enhanced capabilities in general visual recognition and reasoning, and shows marked improvements in recognition accuracy across various business scenarios such as security, in-store inspections, equipment monitoring, and photo-based problem solving. This version is a snapshot as of January 22, 2026.
alibaba
Qwen3 VL Plus
The Qwen3 series of visual understanding models effectively integrates thinking and non-thinking modes. Compared to the snapshot released on September 23, this version delivers superior performance in reasoning and analysis tasks as well as style control, while also offering lower latency and faster response speeds. This version is based on a snapshot taken on December 19, 2025.
alibaba
Qwen3 VL Thinking
Our premier vision-language model combining 235B parameters with exceptional visual understanding and reasoning. This model excels at analyzing images, charts, diagrams, and documents with a special focus on STEM and mathematical content. With dramatically improved OCR capabilities and visual perception, it handles everything from handwritten equations to complex technical diagrams. Perfect for research, education, and any task requiring sophisticated visual analysis.
alibaba
Qwen3.5 122B A10B
The Qwen3.5 122B-A10B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. In terms of overall performance, this model is second only to Qwen3.5-397B-A17B. Its text capabilities significantly outperform those of Qwen3-235B-2507, and its visual capabilities surpass those of Qwen3-VL-235B.
alibaba
Qwen3.5 27B
The Qwen3.5 27B native vision-language Dense model incorporates a linear attention mechanism, delivering fast response times while balancing inference speed and performance. Its overall capabilities are comparable to those of the Qwen3.5-122B-A10B.
alibaba
Qwen3.5 35B A3B
The Qwen3.5 Series 35B-A3B is a native vision-language model designed with a hybrid architecture that integrates linear attention mechanisms and a sparse mixture-of-experts model, achieving higher inference efficiency. Its overall performance is comparable to that of the Qwen3.5-27B.
alibaba
Qwen3.5 397B A17B
The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. It delivers state-of-the-art performance comparable to leading-edge models across a wide range of tasks, including language understanding, logical reasoning, code generation, agent-based tasks, image understanding, video understanding, and graphical user interface (GUI) interactions. With its robust code-generation and agent capabilities, the model exhibits strong generalization across diverse agent scenarios.
alibaba
Qwen3.5 Flash
The Qwen3.5 native vision-language Flash models are built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. Compared to the 3 series, these models deliver a leap forward in performance for both pure text and multimodal tasks, offering fast response times while balancing inference speed and overall performance.
alibaba
Qwen3.5 Plus
The Qwen3.5 native vision-language series Plus models are built on a hybrid architecture that integrates linear attention mechanisms with sparse mixture-of-experts models, achieving higher inference efficiency. In a variety of task evaluations, the 3.5 series consistently demonstrates performance on par with state-of-the-art leading models. Compared to the 3 series, these models show a leap forward in both pure-text and multimodal capabilities.
alibaba
QwQ 32B
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
deepseek
R1 Distill Llama 70B
DeepSeek R1 Distill Llama 70B is a distilled large language model based on Llama-3.3-70B-Instruct, using outputs from DeepSeek R1. The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including AIME 2024 pass@1: 70.0, MATH-500 pass@1: 94.5, and CodeForces Rating: 1633. The model leverages fine-tuning from DeepSeek R1 outputs, enabling competitive performance comparable to larger frontier models.
deepseek
R1 Distill Qwen 32B
DeepSeek R1 Distill Qwen 32B is a distilled large language model based on Qwen 2.5 32B, using outputs from DeepSeek R1. It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models. Benchmark results include AIME 2024 pass@1: 72.6, MATH-500 pass@1: 94.3, and CodeForces Rating: 1691. The model leverages fine-tuning from DeepSeek R1 outputs, enabling competitive performance comparable to larger frontier models.
tngtech
R1T Chimera
TNG-R1T-Chimera is an experimental LLM with a faible for creative storytelling and character interaction. It is a derivate of the original TNG/DeepSeek-R1T-Chimera released in April 2025. Characteristics and improvements include: creative and pleasant personality, preliminary EQ-Bench3 value of about 1305, quite a bit more intelligent than the original albeit slightly slower, much more think-token consistent with properly delineated reasoning and answer blocks, and much improved tool calling.
relace
Relace Apply 3
Relace Apply 3 is a specialized code-patching LLM that merges AI-suggested edits straight into your source files. It can apply updates from GPT-4o, Claude, and others into your files at 10,000 tokens/sec on average. The model requires the prompt to be in the following format: <instruction>{instruction}</instruction> <code>{initial_code}</code> <update>{edit_snippet}</update> Zero Data Retention is enabled for Relace.
relace
Relace Search
The relace-search model uses 4-12 `view_file` and `grep` tools in parallel to explore a codebase and return relevant files to the user request. In contrast to RAG, relace-search performs agentic multi-step reasoning to produce highly precise results 4x faster than any frontier model. It's designed to serve as a subagent that passes its findings to an "oracle" coding agent, who orchestrates/performs the rest of the coding task. To use relace-search you need to build an appropriate agent harness, and parse the response for relevant information to hand off to the oracle.
undi95
ReMM SLERP 13B
A recreation trial of the original MythoMax-L2-B13 but with updated models. #merge
essentialai
Rnj 1 Instruct
Rnj-1 is an 8B-parameter, dense, open-weight model family developed by Essential AI and trained from scratch with a focus on programming, math, and scientific reasoning. The model demonstrates strong performance across multiple programming languages, tool-use workflows, and agentic execution environments (e.g., mini-SWE-agent).
thedrummer
Rocinante 12B
Rocinante 12B is designed for engaging storytelling and rich prose. Early testers have reported expanded vocabulary with unique and expressive word choices, enhanced creativity for vivid narratives, and adventure-filled and captivating stories.
bytedance
Seed 1.6
Seed 1.6 is a general-purpose model released by the ByteDance Seed team. It incorporates multimodal capabilities and adaptive deep thinking with a 256K context window.
bytedance
Seed 1.6 Flash
Seed 1.6 Flash is an ultra-fast multimodal deep thinking model by ByteDance Seed, supporting both text and visual understanding. It features a 256k context window and can generate outputs of up to 16k tokens.
bytedance
Seed 2.0 Mini
Seed-2.0-mini targets latency-sensitive, high-concurrency, and cost-sensitive scenarios, emphasizing fast response and flexible inference deployment. It delivers performance comparable to ByteDance-Seed-1.6, supports 256k context, four reasoning effort modes (minimal/low/medium/high), multimodal understanding, and is optimized for lightweight tasks where cost and speed take priority.
thedrummer
Skyfall 36B V2
Skyfall 36B v2 is an enhanced iteration of Mistral Small 2501, specifically fine-tuned for improved creativity, nuanced writing, role-playing, and coherent storytelling.
upstage
Solar Pro 3
Solar Pro 3 is Upstage's powerful Mixture-of-Experts (MoE) language model. With 102B total parameters and 12B active parameters per forward pass, it delivers exceptional performance while maintaining computational efficiency. Optimized for Korean with English and Japanese support.
perplexity
Sonar
Sonar is lightweight, affordable, fast, and simple to use — now featuring citations and the ability to customize sources. It is designed for companies seeking to integrate lightweight question-and-answer features optimized for speed.
perplexity
Sonar Deep Research
Sonar Deep Research is a research-focused model designed for multi-step retrieval, synthesis, and reasoning across complex topics. It autonomously searches, reads, and evaluates sources, refining its approach as it gathers information. This enables comprehensive report generation across domains like finance, technology, health, and current events. Notes on Pricing: Input tokens comprise of Prompt tokens (user prompt) + Citation tokens (these are processed tokens from running searches). Deep Research runs multiple searches to conduct exhaustive research. Searches are priced at $5/1000 searches. A request that does 30 searches will cost $0.15 in this step. Reasoning is a distinct step in Deep Research since it does extensive automated reasoning through all the material it gathers during its research phase. Reasoning tokens here are a bit different than the CoTs in the answer - these are tokens that we use to reason through the research material prior to generating the outputs via the CoTs. Reasoning tokens are priced at $3/1M tokens.
perplexity
Sonar Pro
Sonar Pro pricing includes Perplexity search pricing. For enterprises seeking more advanced capabilities, the Sonar Pro API can handle in-depth, multi-step queries with added extensibility, like double the number of citations per search as Sonar on average. Plus, with a larger context window, it can handle longer and more nuanced searches and follow-up questions.
perplexity
Sonar Pro Search
Sonar Pro Search is Perplexity's most advanced agentic search system. It is designed for deeper reasoning and analysis. Pricing is based on tokens plus $18 per thousand requests. This model powers the Pro Search mode on the Perplexity platform. Sonar Pro Search adds autonomous, multi-step reasoning to Sonar Pro. So, instead of just one query + synthesis, it plans and executes entire research workflows using tools.
perplexity
Sonar Reasoning Pro
Sonar Pro pricing includes Perplexity search pricing. Sonar Reasoning Pro is a premier reasoning model powered by DeepSeek R1 with Chain of Thought (CoT). Designed for advanced use cases, it supports in-depth, multi-step queries with a larger context window and can surface more citations per search, enabling more comprehensive and extensible responses.
raifle
SorcererLM 8x22B
SorcererLM is an advanced RP and storytelling model, built as a Low-rank 16-bit LoRA fine-tuned on WizardLM-2 8x22B. Features advanced reasoning and emotional intelligence for engaging and immersive interactions, vivid writing capabilities enriched with spatial and contextual awareness, and enhanced narrative depth promoting creative and dynamic storytelling.
arcee
Spotlight
Spotlight is a 7-billion-parameter vision-language model derived from Qwen 2.5-VL and fine-tuned by Arcee AI for tight image-text grounding tasks. It offers a 32 k-token context window, enabling rich multimodal conversations that combine lengthy documents with one or more images. Training emphasized fast inference on consumer GPUs while retaining strong captioning, visual‐question-answering, and diagram-analysis accuracy. As a result, Spotlight slots neatly into agent workflows where screenshots, charts or UI mock-ups need to be interpreted on the fly. Early benchmarks show it matching or out-scoring larger VLMs such as LLaVA-1.6 13 B on popular VQA and POPE alignment tests.
stepfun
Step 3.5 Flash
Step 3.5 Flash is StepFun's most capable open-source foundation model. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token. It is a reasoning model that is incredibly speed efficient even at long contexts.
alibaba
Tongyi DeepResearch 30B A3B
Tongyi DeepResearch is an agentic large language model developed by Tongyi Lab, with 30 billion total parameters activating only 3 billion per token. It's optimized for long-horizon, deep information-seeking tasks and delivers state-of-the-art performance on benchmarks like Humanity's Last Exam, BrowserComp, BrowserComp-ZH, WebWalkerQA, GAIA, xbench-DeepSearch, and FRAMES. This makes it superior for complex agentic search, reasoning, and multi-step problem-solving compared to prior models. The model includes a fully automated synthetic data pipeline for scalable pre-training, fine-tuning, and reinforcement learning. It uses large-scale continual pre-training on diverse agentic data to boost reasoning and stay fresh. It also features end-to-end on-policy RL with a customized Group Relative Policy Optimization, including token-level gradients and negative sample filtering for stable training. The model supports ReAct for core ability checks and an IterResearch-based 'Heavy' mode for max performance through test-time scaling. It's ideal for advanced research agents, tool use, and heavy inference workflows.
arcee
Trinity Large
Trinity Large (Preview) is a 400B-parameter (13B active) sparse mixture-of-experts language model, engineered to scale model capacity while maintaining inference efficiency over long contexts. It delivers strong performance in reasoning-heavy workloads including math, coding-related tasks, and multi-step agent workflows. With a 131K context window and native function calling, it excels at complex tasks requiring deep understanding and structured outputs.
arcee
Trinity Mini
Trinity Mini is a 26B-parameter (3B active) sparse mixture-of-experts language model, engineered for efficient inference over long contexts with robust function calling and multi-step agent workflows. With 128K context, it delivers an outstanding price-to-performance ratio while maintaining coherent multi-turn reasoning and reliable tool use. Ideal for production deployments where speed and cost efficiency are paramount.
bytedance
UI TARS 7B
UI-TARS-1.5 is a multimodal vision-language agent optimized for GUI-based environments, including desktop interfaces, web browsers, mobile systems, and games. Built by ByteDance, it builds upon the UI-TARS framework with reinforcement learning-based reasoning, enabling robust action planning and execution across virtual interfaces. This model achieves state-of-the-art results on a range of interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across diverse Poki games and outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants, with the 1.5 version notably exceeding the performance of earlier 72B and 7B checkpoints.
venice
Uncensored
Venice Uncensored Dolphin Mistral 24B Venice Edition is a fine-tuned variant of Mistral-Small-24B-Instruct-2501, developed by dphn.ai in collaboration with Venice.ai. This model is designed as an “uncensored” instruct-tuned LLM, preserving user control over alignment, system prompts, and behavior. Intended for advanced and unrestricted use cases, Venice Uncensored emphasizes steerability and transparent behavior, removing default safety and alignment layers typically found in mainstream assistant models.
thedrummer
UnslopNemo 12B
UnslopNemo v4.1 is the latest addition from the creator of Rocinante, designed for adventure writing and role-play scenarios.
arcee
Virtuoso Large
Virtuoso-Large is Arcee's top-tier general-purpose LLM at 72 B parameters, tuned to tackle cross-domain reasoning, creative writing and enterprise QA. Unlike many 70 B peers, it retains the 128 k context inherited from Qwen 2.5, letting it ingest books, codebases or financial filings wholesale. Training blended DeepSeek R1 distillation, multi-epoch supervised fine-tuning and a final DPO/RLHF alignment stage, yielding strong performance on BIG-Bench-Hard, GSM-8K and long-context Needle-In-Haystack tests. Enterprises use Virtuoso-Large as the "fallback" brain in Conductor pipelines when other SLMs flag low confidence. Despite its size, aggressive KV-cache optimizations keep first-token latency in the low-second range on 8× H100 nodes, making it a practical production-grade powerhouse.
mistral
Voxtral Mini
A mini audio understanding model released in July 2025
mistral
Voxtral Small
A small audio understanding model released in July 2025
mancer
Weaver
An attempt to recreate Claude-style verbosity, but don't expect the same level of coherence or memory. Meant for use in roleplay/narrative situations.
microsoft
WizardLM 2 8x22B
WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to leading proprietary models, and it consistently outperforms all existing state-of-the-art opensource models. It is an instruct finetune of Mixtral 8x22B. #moe