ERNIE 5.0 Technical Report

4 February 2026

Haifeng Wang

Hua Wu

Tian Wu

Yu Sun

Jing Liu

Dianhai Yu

Yanjun Ma

Jingzhou He

Zhongjun He

Dou Hong

Qiwen Liu

Shuohuan Wang

Junyuan Shang

Zhenyu Zhang

Yuchen Ding

Jinle Zeng

Jiabin Yang

Liang Shen

Ruibiao Chen

Weichong Yin

Siyu Ding

Dai Dai

Shikun Feng

Siqi Bao

Bolei He

Yan Chen

Zhenyu Jiao

Ruiqing Zhang

Zeyu Chen

Qingqing Dang

Kaipeng Deng

Jiajun Jiang

Enlei Gong

Guoxia Wang

Yanlin Sha

Yi Liu

Yehan Zheng

Weijian Xu

Jiaxiang Liu

Zengfeng Zeng

Yingqi Qu

Zhongli Li

Zhengkun Zhang

Xiyang Wang

Zixiang Xu

Xinchao Xu

Zhengjie Huang

Dong Wang

Bingjin Chen

Yue Chang

Xing Yuan

Shiwei Huang

Qiao Zhao

Xinzhe Ding

Shuangshuang Qiao

Baoshan Yang

Bihong Tang

Bin Li

Bingquan Wang

Binhan Tang

Binxiong Zheng

Bo Cui

Bo Ke

Bo Zhang

Bowen Zhang

Boyan Zhang

Boyang Liu

Caiji Zhang

Can Li

Chang Xu

Chao Pang

Chao Zhang

Chaoyi Yuan

Chen Chen

Cheng Cui

Chenlin Yin

Chun Gan

Chunguang Chai

Chuyu Fang

Cuiyun Han

Dan Zhang

Danlei Feng

Danxiang Zhu

Dong Sun

Dongbo Li

Dongdong Li

Dongdong Liu

Dongxue Liu

Fan Ding

Fan Hu

Fan Li

Fan Mo

Feisheng Wu

Fengwei Liu

Gangqiang Hu

Gaofeng Lu

Gaopeng Yong

Gexiao Tian

Guan Wang

Guangchen Ni

MoE

ArXiv (abs)PDF HTML HuggingFace (246 upvotes)

Main:28 Pages

10 Figures

Bibliography:8 Pages

14 Tables

Abstract

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

View on arXiv

Comments on this paper