MiMo-V2.5-Pro: An Open-Source MoE Giant with 1.02T Parameters and 1M Context
MiMo-V2.5-Pro: Redefining Ultra-Scale Open-Source Models
MiMo-V2.5-Pro is a state-of-the-art open-source Mixture-of-Experts (MoE) language model. It features a total of 1.02 trillion (1.02T) parameters, with 42 billion (42B) active parameters per token. Designed for the most demanding agentic tasks, complex software engineering, and long-horizon reasoning, it supports a massive context window of up to 1 million (1M) tokens.
Key Technical Breakthroughs
The exceptional performance of MiMo-V2.5-Pro is driven by deep architectural optimizations:
1. Hybrid Attention Architecture
To tackle the quadratic complexity of long contexts, MiMo-V2.5-Pro interleaves Sliding Window Attention (SWA) and Global Attention (GA) at a 6:1 ratio.
- KV-Cache Optimization: This design reduces KV-cache storage requirements by nearly 7x.
- Context Retention: By utilizing a learnable attention sink bias, the model maintains superior long-context performance while drastically reducing memory footprint.
2. Multi-Token Prediction (MTP)
The model integrates an MTP module consisting of three lightweight dense FFN layers.
- Inference Acceleration: During inference, MTP can predict multiple tokens simultaneously, increasing output speed by nearly 3x.
- Training Efficiency: This module also significantly accelerates the rollout phase in Reinforcement Learning (RL) training.
3. 1 Million Token Ultra-Long Context
In OpenAI’s GraphWalks benchmark, MiMo-V2.5-Pro demonstrated a quantum leap in long-context reasoning. While V2 Pro suffered from rapid degradation at 1M tokens, V2.5 Pro remains stable even at 512k and 1M tokens (scoring 0.37 BFS / 0.62 Parents at 1M), marking a significant milestone in long-horizon coherence.
Training Process: A Three-Stage Evolution
MiMo-V2.5-Pro follows a rigorous post-training paradigm:
- Supervised Fine-Tuning (SFT): Building foundational instruction-following skills using curated high-quality data pairs.
- Domain-Specialized Training: Optimizing for specific domains such as mathematics, safety, and complex agentic tool-use via specialized RL rewards.
- Multi-Teacher On-Policy Distillation (MOPD): Using dynamic on-policy RL, the student model iteratively learns from its own outputs under precise token-level guidance from expert teachers to seamlessly integrate diverse capabilities.
Performance Benchmarks
MiMo-V2.5-Pro leads across multiple authoritative benchmarks:
- General Intelligence: Achieving 92.8 on MMLU-Redux and 88.4 on BBH.
- Math & Code: Nearly perfect scores on GSM8K (99.6) and 75.6 on HumanEval+.
- Chinese Language: Exceptional performance on C-Eval (91.5) and CMMLU (90.2).
- Software Engineering: Scoring 35.7 on SWE-Bench (AgentLess), proving its potential for professional-grade software development.
Deployment & Usage
MiMo-V2.5-Pro is recommended for deployment via SGLang or vLLM. Thanks to FP8 (E4M3) mixed precision training, the model minimizes VRAM usage without sacrificing accuracy.
For local deployment, the recommended sampling parameters are: temperature=1.0, top_p=0.95.
Resources:
- π€ HuggingFace
- π€ ModelScope