LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding

Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques. Project page:this https URL.
View on arXiv@article{cao2025_2502.20389, title={ LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding }, author={ Ang Cao and Sergio Arnaud and Oleksandr Maksymets and Jianing Yang and Ayush Jain and Sriram Yenamandra and Ada Martin and Vincent-Pierre Berges and Paul McVay and Ruslan Partsey and Aravind Rajeswaran and Franziska Meier and Justin Johnson and Jeong Joon Park and Alexander Sax }, journal={arXiv preprint arXiv:2502.20389}, year={ 2025 } }