DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

16 February 2026

En Yu

Haoran Lv

Jianjian Sun

Kangheng Lin

Ruitao Zhang

Yukang Shi

Yuyang Chen

Ze Chen

Ziheng Zhang

Fan Jia

Kaixin Liu

Meng Zhang

Ruitao Hao

Saike Huang

Songhan Xie

Yu Liu

Zhao Wu

Bin Xie

Pengwei Zhang

Qi Yang

Xianchi Deng

Yunfei Wei

Enwen Zhang

Hongyang Peng

Jie Zhao

Kai Liu

Wei Sun

Yajun Wei

Yi Yang

Yunqiao Zhang

Ziwei Yan

Haitao Yang

Hao Liu

Haoqiang Fan

Haowei Zhang

Junwen Huang

Yang Chen

Yunchao Ma

Yunhuan Yang

Zhengyuan Du

Ziming Liu

Jiahui Niu

Yucheng Zhao

Daxin Jiang

Wenbin Tang

Xiangyu Zhang

Zheng Ge

Erjin Zhou

Tiancai Wang

LM&Ro

ArXiv (abs)PDF HTML Github (742★)

Main:14 Pages

3 Figures

Bibliography:5 Pages

1 Tables

Appendix:5 Pages

Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.

View on arXiv

Comments on this paper