78
2

PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation

Abstract

Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisance. We argue that language prior can enhance monocular depth estimation by leveraging the inductive bias learned during the text-to-image pre-training of diffusion models. The ability of these models to generate images that align with text indicates that they have learned the spatial relationships, size, and shape of specified objects, which can be applied to improve depth estimation. Thus, we propose PriorDiffusion, using a pre-trained text-to-image diffusion model that takes both images and corresponding text descriptions to infer affine-invariant depth through a denoising process. We also show that language prior enhances the model's perception of specific regions of images that users care about and describe. Simultaneously, language prior acts as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. By training on HyperSim and Virtual KITTI, we achieve faster training convergence, fewer inference diffusion steps, and state-of-the-art zero-shot performance across NYUv2, KITTI, ETH3D, and ScanNet. Code will be released upon acceptance.

View on arXiv
@article{zeng2025_2411.16750,
  title={ PriorDiffusion: Leverage Language Prior in Diffusion Models for Monocular Depth Estimation },
  author={ Ziyao Zeng and Jingcheng Ni and Daniel Wang and Patrick Rim and Younjoon Chung and Fengyu Yang and Byung-Woo Hong and Alex Wong },
  journal={arXiv preprint arXiv:2411.16750},
  year={ 2025 }
}
Comments on this paper