P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders

Pre-training in 3D is pivotal for advancing 3D perception tasks. However, the scarcity of clean 3D data poses significant challenges for scaling 3D pre-training efforts. Drawing inspiration from semi-supervised learning, which effectively combines limited labeled data with abundant unlabeled data, we introduce an innovative self-supervised pre-training framework. This framework leverages both authentic 3D data and pseudo-3D data generated from images using a robust depth estimation model. Another critical challenge is the efficiency of the pre-training process. Existing approaches, such as Point-BERT and Point-MAE, utilize k-nearest neighbors for 3D token embedding, resulting in quadratic time complexity. To address this, we propose a novel token embedding strategy with linear time complexity, coupled with a training-efficient 2D reconstruction target. Our method not only achieves state-of-the-art performance in 3D classification, detection, and few-shot learning but also ensures high efficiency in both pre-training and downstream fine-tuning processes.
View on arXiv@article{chen2025_2408.10007, title={ P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders }, author={ Xuechao Chen and Ying Chen and Jialin Li and Qiang Nie and Hanqiu Deng and Yong Liu and Qixing Huang and Yang Li }, journal={arXiv preprint arXiv:2408.10007}, year={ 2025 } }