Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

13 February 2026

Rui Cai

Jun Guo

Xinze He

Piaopiao Jin

Jie Li

Bingxuan Lin

Futeng Liu

Wei Liu

Fei Ma

Kun Ma

Feng Qiu

Heng Qu

Yifei Su

Qiao Sun

Dong Wang

Donghao Wang

Yunhong Wang

Rujie Wu

Diyun Xiang

Yu Yang

Hangjun Ye

Yuan Zhang

Quanyun Zhou

LM&Ro

ArXiv (abs)PDF HTML HuggingFace (5 upvotes)Github (10236★)

Main:16 Pages

8 Figures

Bibliography:1 Pages

5 Tables

Appendix:6 Pages

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced atthis https URL

View on arXiv

Comments on this paper