Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition

27 February 2026

Mohammadreza Heidarianbaei

Mareike Dorozynski

Hubert Kanyamahanga

Max Mehltretter

Franz Rottensteiner

VLM

ArXiv (abs)PDF HTML Github

Main:9 Pages

3 Figures

Bibliography:5 Pages

4 Tables

Abstract

In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.

View on arXiv

Comments on this paper