LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

1 March 2026

Zebin You

Xiaolu Zhang

Jun Zhou

Chongxuan Li

Ji-Rong Wen

VLM

ArXiv (abs)PDF HTML HuggingFace (19 upvotes)Github

Main:8 Pages

5 Figures

Bibliography:4 Pages

10 Tables

Appendix:7 Pages

Abstract

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available atthis https URL.

View on arXiv

Comments on this paper