173
v1v2 (latest)

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie
Maosong Cao
Kun Chen
Lun Du
Mingliang Gong
Zhuochen Gong
Yanmei Gu
Jiaqi Hu
Zenan Huang
Zhenzhong Lan
Chengxi Li
Chongxuan Li
Jianguo Li
Zehuan Li
Huabin Liu
Lin Liu
Guoshan Lu
Xiaocheng Lu
Yuxin Ma
Jianfeng Tan
Lanning Wei
Ji-Rong Wen
Yipeng Xing
Xiaolu Zhang
Junbo Zhao
Da Zheng
Jun Zhou
Junlin Zhou
Zhanchao Zhou
Liwang Zhu
Yihong Zhuang
Main:14 Pages
7 Figures
Bibliography:5 Pages
2 Tables
Abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

View on arXiv
Comments on this paper