VGG-T $^3$ : Offline Feed-Forward 3D Reconstruction at Scale

26 February 2026

Sven Elflein

Ruilong Li

Sérgio Agostinho

Zan Gojcic

Laura Leal-Taixé

Qunjie Zhou

Aljosa Osep

OffRL

3DV

ArXiv (abs)PDF HTML HuggingFace (13 upvotes)

Main:11 Pages

10 Figures

Bibliography:5 Pages

9 Tables

Appendix:4 Pages

Abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T $^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

View on arXiv

Comments on this paper

VGG-T3^33: Offline Feed-Forward 3D Reconstruction at Scale

VGG-T $^3$ : Offline Feed-Forward 3D Reconstruction at Scale