37
0

NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving

Abstract

Multi-view 3D visual grounding is critical for autonomous driving vehicles to interpret natural languages and localize target objects in complex environments. However, existing datasets and methods suffer from coarse-grained language instructions, and inadequate integration of 3D geometric reasoning with linguistic comprehension. To this end, we introduce NuGrounding, the first large-scale benchmark for multi-view 3D visual grounding in autonomous driving. We present a Hierarchy of Grounding (HoG) method to construct NuGrounding to generate hierarchical multi-level instructions, ensuring comprehensive coverage of human instruction patterns. To tackle this challenging dataset, we propose a novel paradigm that seamlessly combines instruction comprehension abilities of multi-modal LLMs (MLLMs) with precise localization abilities of specialist detection models. Our approach introduces two decoupled task tokens and a context query to aggregate 3D geometric information and semantic instructions, followed by a fusion decoder to refine spatial-semantic feature fusion for precise localization. Extensive experiments demonstrate that our method significantly outperforms the baselines adapted from representative 3D scene understanding methods by a significant margin and achieves 0.59 in precision and 0.64 in recall, with improvements of 50.8% and 54.7%.

View on arXiv
@article{li2025_2503.22436,
  title={ NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving },
  author={ Fuhao Li and Huan Jin and Bin Gao and Liaoyuan Fan and Lihui Jiang and Long Zeng },
  journal={arXiv preprint arXiv:2503.22436},
  year={ 2025 }
}
Comments on this paper