41

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

En Yu
Haoran Lv
Jianjian Sun
Kangheng Lin
Ruitao Zhang
Yukang Shi
Yuyang Chen
Ze Chen
Ziheng Zhang
Fan Jia
Kaixin Liu
Meng Zhang
Ruitao Hao
Saike Huang
Songhan Xie
Yu Liu
Zhao Wu
Bin Xie
Pengwei Zhang
Qi Yang
Xianchi Deng
Yunfei Wei
Enwen Zhang
Hongyang Peng
Jie Zhao
Kai Liu
Wei Sun
Yajun Wei
Yi Yang
Yunqiao Zhang
Ziwei Yan
Haitao Yang
Hao Liu
Haoqiang Fan
Haowei Zhang
Junwen Huang
Yang Chen
Yunchao Ma
Yunhuan Yang
Zhengyuan Du
Ziming Liu
Jiahui Niu
Yucheng Zhao
Daxin Jiang
Wenbin Tang
Xiangyu Zhang
Zheng Ge
Erjin Zhou
Tiancai Wang
Main:14 Pages
3 Figures
Bibliography:5 Pages
1 Tables
Appendix:5 Pages
Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.

View on arXiv
Comments on this paper