52
0

LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding

Abstract

Our approach to training 3D vision-language understanding models is to train a feedforward model that makes predictions in 3D, but never requires 3D labels and is supervised only in 2D, using 2D losses and differentiable rendering. The approach is new for vision-language understanding. By treating the reconstruction as a ``latent variable'', we can render the outputs without placing unnecessary constraints on the network architecture (e.g. can be used with decoder-only models). For training, only need images and camera pose, and 2D labels. We show that we can even remove the need for 2D labels by using pseudo-labels from pretrained 2D models. We demonstrate this to pretrain a network, and we finetune it for 3D vision-language understanding tasks. We show this approach outperforms baselines/sota for 3D vision-language grounding, and also outperforms other 3D pretraining techniques. Project page:this https URL.

View on arXiv
@article{cao2025_2502.20389,
  title={ LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding },
  author={ Ang Cao and Sergio Arnaud and Oleksandr Maksymets and Jianing Yang and Ayush Jain and Sriram Yenamandra and Ada Martin and Vincent-Pierre Berges and Paul McVay and Ruslan Partsey and Aravind Rajeswaran and Franziska Meier and Justin Johnson and Jeong Joon Park and Alexander Sax },
  journal={arXiv preprint arXiv:2502.20389},
  year={ 2025 }
}
Comments on this paper