Aerial Remote Sensing (ARS) vision tasks pose significant challenges due to the unique characteristics of their viewing angles. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes the RingMo-Aerial model, aiming to fill the gap in foundation model research in the field of ARS vision. By introducing the Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism and an affine transformation-based contrastive learning pre-training method, the model's detection capability for small targets is enhanced and optimized for the tilted viewing angles characteristic of ARS. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and effectiveness in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and efficacy of RingMo-Aerial in enhancing the performance of ARS vision tasks.
View on arXiv@article{diao2025_2409.13366, title={ RingMo-Aerial: An Aerial Remote Sensing Foundation Model With A Affine Transformation Contrastive Learning }, author={ Wenhui Diao and Haichen Yu and Kaiyue Kang and Tong Ling and Di Liu and Yingchao Feng and Hanbo Bi and Libo Ren and Xuexue Li and Yongqiang Mao and Xian Sun }, journal={arXiv preprint arXiv:2409.13366}, year={ 2025 } }