Gate News message, April 24 — Xiaomi's large language model team lead Luo Fuli disclosed in an in-depth interview that the MiMo-V2-Pro model has 1 trillion parameters in total and required thousands of GPUs for training. She noted that the 1T scale represents the minimum threshold to achieve performance approaching Claude Opus 4.6 level and secure a competitive entry ticket for the next phase of AI agents.
Technically, the Pro version employs an extreme sparse attention mechanism with a 7:1 ratio between global attention and sliding window attention, controlling inference costs for long-context processing. The model also retains the MTP (Multi-Token Prediction) architecture to leverage surplus compute power for faster inference.
On the management side, the 100-person MiMo team has only 30-40 people directly engaged in core iterations. The team operates without formal hierarchies or explicit sub-group divisions and delivery deadlines. When encountering unstable numerical issues such as training loss spikes, the team prioritizes halting training for investigation, even if it means stopping operations for one or two weeks and incurring millions of dollars in compute costs.