238

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Robert Moreno
Michael Neunert
Francesco Nori
Joy Ortiz
Kenneth Oslund
Carolina Parada
Emilio Parisotto
Amaris Paryag
Acorn Pooley
Thomas Power
Alessio Quaglino
Haroon Qureshi
Rajkumar Vasudeva Raju
Helen Ran
Dushyant Rao
Kanishka Rao
Isaac Reid
David Rendleman
Krista Reymann
Miguel Rivas
Francesco Romano
Yulia Rubanova
Peter Pastor Sampedro
Pannag R Sanketi
Dhruv Shah
Mohit Sharma
Kathryn Shea
Mohit Shridhar
Charles Shu
Vikas Sindhwani
Sumeet Singh
Radu Soricut
Rachel Sterneck
Ian Storz
Razvan Surdulescu
Jie Tan
Jonathan Tompson
Saran Tunyasuvunakool
Jake Varley
Grace Vesom
Giulia Vezzani
Maria Bauza Villalonga
Oriol Vinyals
René Wagner
Ayzaan Wahid
Stefan Welker
Paul Wohlhart
Chengda Wu
Markus Wulfmeier
Fei Xia
Ted Xiao
Annie Xie
Jinyu Xie
Peng Xu
Sichun Xu
Ying Xu
Zhuo Xu
Jimmy Yan
Sherry Yang
Skye Yang
Yuxiang Yang
Hiu Hong Yu
Wenhao Yu
Wentao Yuan
Yuan Yuan
Jingwei Zhang
Tingnan Zhang
Zhiyuan Zhang
Allan Zhou
Guangyao Zhou
Yuxiang Zhou
et al. (71 additional authors not shown)
Main:22 Pages
44 Figures
Bibliography:4 Pages
25 Tables
Appendix:36 Pages
Abstract

General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.

View on arXiv
Comments on this paper