Temporal Triplane Transformers as Occupancy World Models
Recent years have seen significant advances in world models, which primarily focus on learning fine-grained correlations between an agent's motion trajectory and the resulting changes in its surrounding environment. However, existing methods often struggle to capture such fine-grained correlations and achieve real-time predictions. To address this, we propose a new 4D occupancy world model for autonomous driving, termed TFormer. TFormer begins by pre-training a compact triplane representation that efficiently compresses the 3D semantically occupied environment. Next, TFormer extracts multi-scale temporal motion features from the historical triplane and employs an autoregressive approach to iteratively predict the next triplane changes. Finally, TFormer combines the triplane changes with the previous ones to decode them into future occupancy results and ego-motion trajectories. Experimental results demonstrate the superiority of TFormer, achieving 1.44 faster inference speed (26 FPS), while improving the mean IoU to 36.09 and reducing the mean absolute planning error to 1.0 meters.
View on arXiv@article{xu2025_2503.07338, title={ Temporal Triplane Transformers as Occupancy World Models }, author={ Haoran Xu and Peixi Peng and Guang Tan and Yiqian Chang and Yisen Zhao and Yonghong Tian }, journal={arXiv preprint arXiv:2503.07338}, year={ 2025 } }