Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Self-supervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ the generative-based diffusion model with a unique denoising training process for self-supervised monocular depth estimation. Additionally, to further enhance the robustness of the diffusion model, we probe into the influence of perturbations on image features and propose a hierarchical feature-guided denoising module. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function is not interfered by the other subnetwork, which can be targeted to constrain the depth estimation network and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI and Make3D datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.
View on arXiv@article{liu2025_2406.09782, title={ Self-supervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion }, author={ Runze Liu and Dongchen Zhu and Guanghui Zhang and Lei Wang and Jiamao Li }, journal={arXiv preprint arXiv:2406.09782}, year={ 2025 } }