ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2512.24653
176
0
v1v2 (latest)

RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence

31 December 2025
Chengkai Hou
Kun Wu
Jiaming Liu
Zhengping Che
Di Wu
Fei Liao
Guangrun Li
Jingyang He
Qiuxuan Feng
Zhao Jin
Chenyang Gu
Zhuoyang Liu
Nuowei Han
Xiangju Mi
Yaoxu Lv
Yankai Fu
Gaole Dai
Langzhe Gu
Tao Li
Yuheng Zhang
Yixue Zhang
Xinhua Wang
Shichao Fan
Meng Li
Zhen Zhao
Ning Liu
Zhiyuan Xu
Pei Ren
Junjie Ji
Haonan Liu
Kuan Cheng
Shanghang Zhang
Jian Tang
    LM&Ro
ArXiv (abs)PDFHTML
Main:63 Pages
15 Figures
9 Tables
Abstract

While data-driven imitation learning has revolutionized robotic manipulation, current approaches remain constrained by the scarcity of large-scale, diverse real-world demonstrations. Consequently, the ability of existing models to generalize across long-horizon bimanual tasks and mobile manipulation in unstructured environments remains limited. To bridge this gap, we present RoboMIND 2.0, a comprehensive real-world dataset comprising over 310K dual-arm manipulation trajectories collected across six distinct robot embodiments and 739 complex tasks. Crucially, to support research in contact-rich and spatially extended tasks, the dataset incorporates 12K tactile-enhanced episodes and 20K mobile manipulation trajectories. Complementing this physical data, we construct high-fidelity digital twins of our real-world environments, releasing an additional 20K-trajectory simulated dataset to facilitate robust sim-to-real transfer. To fully exploit the potential of RoboMIND 2.0, we propose MIND-2 system, a hierarchical dual-system frame-work optimized via offline reinforcement learning. MIND-2 integrates a high-level semantic planner (MIND-2-VLM) to decompose abstract natural language instructions into grounded subgoals, coupled with a low-level Vision-Language-Action executor (MIND-2-VLA), which generates precise, proprioception-aware motor actions.

View on arXiv
Comments on this paper