ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.12957
37
15

R+X: Retrieval and Execution from Everyday Human Videos

17 July 2024
Georgios Papagiannis
Norman Di Palo
Pietro Vitiello
Edward Johns
ArXivPDFHTML
Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available atthis https URL.

View on arXiv
@article{papagiannis2025_2407.12957,
  title={ R+X: Retrieval and Execution from Everyday Human Videos },
  author={ Georgios Papagiannis and Norman Di Palo and Pietro Vitiello and Edward Johns },
  journal={arXiv preprint arXiv:2407.12957},
  year={ 2025 }
}
Comments on this paper