6/10 Model Release 12 May 2026, 15:01 UTC

Alibaba AIDC releases Ovis2.6-80B-A3B, an open-source multimodal MoE model with visual chain-of-thought.

Ovis2.6-80B-A3B hits a highly practical deployment sweet spot by offering 80B parameter capacity while only requiring 3B active parameters during inference. Its 'Think with Image' capability brings critical visual chain-of-thought reasoning to open-weight models, bridging a gap with proprietary VLMs. The Apache 2.0 license makes this immediately viable for commercial enterprise integration.

What Happened

Alibaba's AIDC team has open-sourced Ovis2.6-80B-A3B, a new multimodal Mixture-of-Experts (MoE) large language model. Concurrently, researchers in the AI community are highlighting the rapid evolution of multimodal capabilities, with demonstrations of advanced single-prompt video generation using Google Gemini and predictions of these heavy models running locally within the next year.

Technical Details

Ovis2.6-80B-A3B is built on an MoE architecture featuring 80 billion total parameters, but highly optimized to use only 3 billion active parameters during inference. It boasts a 64K context window and supports extremely high-resolution image processing up to 2880x2880. The standout technical feature is "Think with Image," which enables visual reasoning within a chain-of-thought (CoT) framework. The model is released under the permissive Apache 2.0 license.

Why It Matters

This release is highly significant for machine learning engineers looking to deploy advanced vision-language models (VLMs) in production. The 80B/3B MoE split allows teams to leverage massive representational capacity without the prohibitive VRAM costs and latency bottlenecks of dense 80B models. High-resolution support (2880x2880) is a critical requirement for real-world enterprise use cases like dense document parsing, OCR, and UI understanding, where standard low-res models frequently hallucinate. Furthermore, visual CoT introduces advanced spatial and logical reasoning to the open-source ecosystem, challenging the dominance of closed-source models like GPT-4o and Gemini. The Apache 2.0 license removes commercial friction, allowing immediate adoption.

What to Watch Next

Monitor community benchmarks comparing Ovis2.6's visual CoT against proprietary models in complex tasks like chart reading, spatial reasoning, and medical imaging. Additionally, expect a continued industry-wide push to optimize these heavy multimodal capabilities for local, consumer-grade hardware (Apple Metal and consumer Nvidia GPUs) over the next 12-14 months.

Sources

x-search-4c51ba2b-2026051215

multimodal moe alibaba open-source computer-vision