5

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

I. Apanasevich
M. Artemyev
R. Babakyan
P. Fedotova
D. Grankin
E. Kupryashin
A. Misailidi
D. Nerus
A. Nutalapati
G. Sidorov
I. Efremov
M. Gerasyov
D. Pikurov
Y. Senchenko
S. Davidenko
D. Kulikov
M. Sultankin
K. Askarbek
O. Shamanin
D. Statovoy
E. Zalyaev
I. Zorin
A. Letkin
E. Rusakov
A. Silchenko
V. Vorobyov
S. Sobolnikov
A. Postnikov
Main:19 Pages
15 Figures
Bibliography:3 Pages
4 Tables
Abstract

We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.

View on arXiv
Comments on this paper