ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19306
110
0

From Single Images to Motion Policies via Video-Generation Environment Representations

25 May 2025
Weiming Zhi
Ziyong Ma
Tianyi Zhang
Matthew Johnson-Roberson
    VGen3DV
ArXiv (abs)PDFHTML
Main:9 Pages
15 Figures
Bibliography:3 Pages
3 Tables
Appendix:1 Pages
Abstract

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

View on arXiv
@article{zhi2025_2505.19306,
  title={ From Single Images to Motion Policies via Video-Generation Environment Representations },
  author={ Weiming Zhi and Ziyong Ma and Tianyi Zhang and Matthew Johnson-Roberson },
  journal={arXiv preprint arXiv:2505.19306},
  year={ 2025 }
}
Comments on this paper