PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

3 May 2025

Abstract

Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

View on arXiv

@article{jin2025_2505.01729,
  title={ PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth },
  author={ Bu Jin and Weize Li and Baihan Yang and Zhenxin Zhu and Junpeng Jiang and Huan-ang Gao and Haiyang Sun and Kun Zhan and Hengtong Hu and Xueyang Zhang and Peng Jia and Hao Zhao },
  journal={arXiv preprint arXiv:2505.01729},
  year={ 2025 }
}

Comments on this paper