34
0

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Abstract

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing work often overlooks the differences between humans and robots, producing unsatisfactory results. In this paper, we study how perfectly aligned human-robot pairs benefit robot learning. Capitalizing on VR-based teleportation, we introduce H\&R, a third-person dataset with 2,600 episodes, each of which captures the fine-grained correspondence between human hand and robot gripper. Inspired by the recent success of diffusion models, we introduce Human2Robot, an end-to-end diffusion framework that formulates learning from human demonstration as a generative task. Human2Robot fully explores temporal dynamics in human videos to generate robot videos and predict actions at the same time. Through comprehensive evaluations of 4 carefully selected tasks in real-world settings, we demonstrate that Human2Robot can not only generate high-quality robot videos but also excels in seen tasks and generalizing to different positions, unseen appearances, novel instances, and even new backgrounds and task types.

View on arXiv
@article{xie2025_2502.16587,
  title={ Human2Robot: Learning Robot Actions from Paired Human-Robot Videos },
  author={ Sicheng Xie and Haidong Cao and Zejia Weng and Zhen Xing and Shiwei Shen and Jiaqi Leng and Xipeng Qiu and Yanwei Fu and Zuxuan Wu and Yu-Gang Jiang },
  journal={arXiv preprint arXiv:2502.16587},
  year={ 2025 }
}
Comments on this paper