Bolt3D: Generating 3D Scenes in Seconds
Stanislaw Szymanowicz
Jason Y. Zhang
P. Srinivasan
Ruiqi Gao
Arthur Brussee
Aleksander Holynski
Ricardo Martín Brualla
Jonathan T. Barron
Philipp Henzler

Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
View on arXiv@article{szymanowicz2025_2503.14445, title={ Bolt3D: Generating 3D Scenes in Seconds }, author={ Stanislaw Szymanowicz and Jason Y. Zhang and Pratul Srinivasan and Ruiqi Gao and Arthur Brussee and Aleksander Holynski and Ricardo Martin-Brualla and Jonathan T. Barron and Philipp Henzler }, journal={arXiv preprint arXiv:2503.14445}, year={ 2025 } }
Comments on this paper