v1v2v3 (latest)

LLaDA2.1: Speeding Up Text Diffusion via Token Editing

9 February 2026

Tiwei Bie

Maosong Cao

Xiang Cao

Bingsen Chen

Fuyuan Chen

Kun Chen

Lun Du

Daozhuo Feng

Haibo Feng

Mingliang Gong

Zhuocheng Gong

Yanmei Gu

Jian Guan

Kaiyuan Guan

Hongliang He

Zenan Huang

Juyong Jiang

Zhonghui Jiang

Zhenzhong Lan

Chengxi Li

Jianguo Li

Zehuan Li

Huabin Liu

Lin Liu

Guoshan Lu

Yuan Lu

Yuxin Ma

Xingyu Mou

Zhenxuan Pan

Kaida Qiu

Yuji Ren

Jianfeng Tan

Yiding Tian

Zian Wang

Lanning Wei

Tao Wu

Yipeng Xing

Wentao Ye

Liangyu Zha

Tianze Zhang

Xiaolu Zhang

Junbo Zhao

Da Zheng

Hao Zhong

Wanli Zhong

Jun Zhou

Junlin Zhou

Liwang Zhu

Muzhi Zhu

Yihong Zhuang

ArXiv (abs)PDF HTML HuggingFace (64 upvotes)Github

Main:8 Pages

4 Figures

Bibliography:3 Pages

4 Tables

Abstract

While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

View on arXiv

Comments on this paper