GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

12 February 2026

GigaBrain Team

Boyuan Wang

Chaojun Ni

Guan Huang

Guosheng Zhao

Hao Li

Jie Li

Jindi Lv

Jingyu Liu

Lv Feng

Mingming Yu

Peng Li

Qiuping Deng

Tianze Liu

Xinyu Zhou

Xinze Chen

Xiaofeng Wang

Yang Wang

Yifan Li

Yifei Nie

Yilong Li

Yukun Zhou

Yun Ye

Zhichao Liu

Zheng Zhu

LM&Ro

VLM

LRM

ArXiv (abs)PDF HTML HuggingFace (55 upvotes)Github (1048★)

Main:14 Pages

18 Figures

Bibliography:6 Pages

1 Tables

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M $^*$ } exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{this https URL}{project page}.

View on arXiv

Comments on this paper