Bird's-Eye-View (BEV) semantic segmentation provides comprehensive environmental perception for autonomous driving but suffers multi-modal misalignment and sensor noise. We propose RESAR-BEV, a progressive refinement framework that advances beyond single-step end-to-end approaches: (1) progressive refinement through residual autoregressive learning that decomposes BEV segmentation into interpretable coarse-to-fine stages via our Drive-Transformer and Modifier-Transformer residual prediction cascaded architecture, (2) robust BEV representation combining ground-proximity voxels with adaptive height offsets and dual-path voxel feature encoding (max+attention pooling) for efficient feature extraction, and (3) decoupled supervision with offline Ground Truth decomposition and online joint optimization to prevent overfitting while ensuring structural coherence. Experiments on nuScenes demonstrate RESAR-BEV achieves state-of-the-art performance with 54.0% mIoU across 7 essential driving-scene categories while maintaining real-time capability at 14.6 FPS. The framework exhibits robustness in challenging scenarios of long-range perception and adverse weather conditions.
View on arXiv@article{zeng2025_2505.06515, title={ RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation }, author={ Zhiwen Zeng and Yunfei Yin and Zheng Yuan and Argho Dey and Xianjian Bao }, journal={arXiv preprint arXiv:2505.06515}, year={ 2025 } }