34

ERNIE 5.0 Technical Report

Haifeng Wang
Hua Wu
Tian Wu
Yu Sun
Jing Liu
Dianhai Yu
Yanjun Ma
Jingzhou He
Zhongjun He
Dou Hong
Qiwen Liu
Shuohuan Wang
Junyuan Shang
Zhenyu Zhang
Yuchen Ding
Jinle Zeng
Jiabin Yang
Liang Shen
Ruibiao Chen
Weichong Yin
Siyu Ding
Dai Dai
Shikun Feng
Siqi Bao
Bolei He
Yan Chen
Zhenyu Jiao
Ruiqing Zhang
Zeyu Chen
Qingqing Dang
Kaipeng Deng
Jiajun Jiang
Enlei Gong
Guoxia Wang
Yanlin Sha
Yi Liu
Yehan Zheng
Weijian Xu
Jiaxiang Liu
Zengfeng Zeng
Yingqi Qu
Zhongli Li
Zhengkun Zhang
Xiyang Wang
Zixiang Xu
Xinchao Xu
Zhengjie Huang
Dong Wang
Bingjin Chen
Yue Chang
Xing Yuan
Shiwei Huang
Qiao Zhao
Xinzhe Ding
Shuangshuang Qiao
Baoshan Yang
Bihong Tang
Bin Li
Bingquan Wang
Binhan Tang
Binxiong Zheng
Bo Cui
Bo Ke
Bo Zhang
Bowen Zhang
Boyan Zhang
Boyang Liu
Caiji Zhang
Can Li
Chang Xu
Chao Pang
Chao Zhang
Chaoyi Yuan
Chen Chen
Cheng Cui
Chenlin Yin
Chun Gan
Chunguang Chai
Chuyu Fang
Cuiyun Han
Dan Zhang
Danlei Feng
Danxiang Zhu
Dong Sun
Dongbo Li
Dongdong Li
Dongdong Liu
Dongxue Liu
Fan Ding
Fan Hu
Fan Li
Fan Mo
Feisheng Wu
Fengwei Liu
Gangqiang Hu
Gaofeng Lu
Gaopeng Yong
Gexiao Tian
Guan Wang
Guangchen Ni
Main:28 Pages
10 Figures
Bibliography:8 Pages
14 Tables
Abstract

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

View on arXiv
Comments on this paper