185
v1v2 (latest)

MiMo-V2-Flash Technical Report

Xiaomi LLM-Core Team
Bangjun Xiao
Bingquan Xia
Bo Yang
Bofei Gao
Bowen Shen
Chen Zhang
Chenhong He
Chiheng Lou
Fuli Luo
Gang Wang
Gang Xie
Hailin Zhang
Hanglong Lv
Hanyu Li
Heyu Chen
Hongshen Xu
Houbin Zhang
Huaqiu Liu
Jiangshan Duo
Jianyu Wei
Jiebao Xiao
Jinhao Dong
Jun Shi
Junhao Hu
Kainan Bao
Kang Zhou
Lei Li
Liang Zhao
Linghao Zhang
Peidian Li
Qianli Chen
Shaohui Liu
Shihua Yu
Shijie Cao
Shimao Chen
Shouqiu Yu
Shuo Liu
Tianling Zhou
Weijiang Su
Weikun Wang
Wenhan Ma
Xiangwei Deng
Bohan Mao
Bowen Ye
Can Cai
Chenghua Wang
Chengxuan Zhu
Chong Ma
Chun Chen
Chunan Li
Dawei Zhu
Deshan Xiao
Dong Zhang
Duo Zhang
Fangyue Liu
Feiyu Yang
Fengyuan Shi
Guoan Wang
Hao Tian
Hao Wu
Heng Qu
Hongfei Yi
Hongxu An
Hongyi Guan
Xing Zhang
Yifan Song
Yihan Yan
Yihao Zhao
Yingchun Lai
Yizhao Gao
Yu Cheng
Yuanyuan Tian
Yudong Wang
Zhen Tang
Zhengju Tang
Zhengtao Wen
Zhichao Song
Zhixian Zheng
Zihan Jiang
Jian Wen
Jiarui Sun
Jiawei Li
Jinlong Xue
Jun Xia
Kai Fang
Menghang Zhu
Nuo Chen
Qian Tu
Qihao Zhang
Qiying Wang
Rang Li
Rui Ma
Shaolei Zhang
Shengfan Wang
Shicheng Li
Shuhao Gu
Shuhuai Ren
Sirui Deng
Tao Guo
Main:21 Pages
7 Figures
Bibliography:7 Pages
11 Tables
Appendix:3 Pages
Abstract

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

View on arXiv
Comments on this paper