266

Better Call SAL: Towards Learning to Segment Anything in Lidar

Abstract

We propose SAL\texttt{SAL} (S\texttt{S}egment A\texttt{A}nything in L\texttt{L}idar) method consisting of a text-promptable zero-shot model for segmenting and classifying any object in Lidar, and a pseudo-labeling engine that facilitates model training without manual supervision. While the established paradigm for Lidar Panoptic Segmentation\textit{Lidar Panoptic Segmentation} (LPS) relies on manual supervision for a handful of object classes defined a priori, we utilize 2D vision foundation models to generate 3D supervision "for free". Our pseudo-labels consist of instance masks and corresponding CLIP tokens, which we lift to Lidar using calibrated multi-modal data. By training our model on these labels, we distill the 2D foundation models into our Lidar SAL\texttt{SAL} model. Even without manual labels, our model achieves 91%91\% in terms of class-agnostic segmentation and 44%44\% in terms of zero-shot LPS of the fully supervised state-of-the-art. Furthermore, we outperform several baselines that do not distill but only lift image features to 3D. More importantly, we demonstrate that SAL\texttt{SAL} supports arbitrary class prompts, can be easily extended to new datasets, and shows significant potential to improve with increasing amounts of self-labeled data.

View on arXiv
Comments on this paper