v1v2 (latest)

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

10 December 2025

Tiwei Bie

Maosong Cao

Kun Chen

Lun Du

Mingliang Gong

Zhuochen Gong

Yanmei Gu

Jiaqi Hu

Zenan Huang

Zhenzhong Lan

Chengxi Li

Chongxuan Li

Jianguo Li

Zehuan Li

Huabin Liu

Lin Liu

Guoshan Lu

Xiaocheng Lu

Yuxin Ma

Jianfeng Tan

Lanning Wei

Ji-Rong Wen

Yipeng Xing

Xiaolu Zhang

Junbo Zhao

Da Zheng

Jun Zhou

Junlin Zhou

Zhanchao Zhou

Liwang Zhu

Yihong Zhuang

AI4CE

ArXiv (abs)PDF HTML HuggingFace (52 upvotes)Github (181★)

Main:14 Pages

7 Figures

Bibliography:5 Pages

2 Tables

Abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

View on arXiv

Comments on this paper