OVA-Det: Open Vocabulary Aerial Object Detection with Image-Text Collaboration

Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose OVA-Det, a highly efficient open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category limitations. Next, we propose a lightweight text-guided strategy that enhances the feature extraction process in the encoder and enables queries to focus on class-relevant image features within the decoder, further improving detection accuracy without introducing significant additional costs. Extensive comparison experiments demonstrate that the proposed OVA-Det outperforms state-of-the-art methods on all three widely used benchmark datasets by a large margin. For instance, for zero-shot detection on DIOR, OVA-Det achieves 37.2 mAP and 79.8 Recall, 12.4 and 42.0 higher than that of YOLO-World. In addition, the inference speed of OVA-Det reaches 36 FPS on RTX 4090, meeting the real-time detection requirements for various applications. The code is available at \href{this https URL}{this https URL}.
View on arXiv@article{wei2025_2408.12246, title={ OVA-Det: Open Vocabulary Aerial Object Detection with Image-Text Collaboration }, author={ Guoting Wei and Xia Yuan and Yu Liu and Zhenhao Shang and Xizhe Xue and Peng Wang and Kelu Yao and Chunxia Zhao and Haokui Zhang and Rong Xiao }, journal={arXiv preprint arXiv:2408.12246}, year={ 2025 } }