Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs) to capture long-range dependencies. However, these serialized frameworks strictly maintain the consistency of input and output voxel dimensions, inherently lacking the capability for voxel expansion. This limitation hinders performance, as expanding the voxel set is known to significantly enhance detection accuracy, particularly for sparse foreground objects. To bridge this gap, we propose a novel Voxel Densification Module (VDM). Unlike standard convolutional stems, VDM is explicitly designed to promote pre-serialization spatial expansion. It leverages sparse 3D convolutions to propagate foreground semantics to neighboring empty voxels, effectively densifying the feature representation before it is flattened into a sequence. Simultaneously, VDM incorporates residual sparse blocks to aggregate fine-grained local context, ensuring rich geometric feature extraction. To balance the computational overhead of increased voxel density, we introduce a strategic cascaded downsampling mechanism. We integrate VDM into both Transformer-based (DSVT) and SSM-based (LION) detectors. Extensive experiments demonstrate that VDM consistently improves detection accuracy across multiple benchmarks. Specifically, our method achieves 74.8 mAPH (L2) on the Waymo validation set and 70.5 mAP on the nuScenes test set. Furthermore, it attains 42.6 mAP on the Argoverse 2 validation set and 67.6 mAP on the ONCE validation set, consistently outperforming the baseline models. The source code will be made publicly available atthis https URL.

View on arXiv

Comments on this paper