v1v2 (latest)

MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

24 December 2025

Xiangzuo Wu

Chengwei Ren

Jun Zhou

Xiu Li

Yuan Liu

3DV

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Main:9 Pages

17 Figures

Bibliography:6 Pages

5 Tables

Appendix:6 Pages

Abstract

Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery. Project page:this https URL

View on arXiv

Comments on this paper