v1v2v3v4v5 (latest)

Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models

International Conference on Advanced Computational Intelligence (ICACI), 2025

28 January 2025

Muhammad Atta ur Rahman

ArXiv (abs)PDF HTML Github

Main:5 Pages

2 Figures

Bibliography:1 Pages

1 Tables

Abstract

Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self-supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks. Our research proposes "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre-trained models without requiring significant retraining, making the approach data-efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving generalization and enabling smooth and consistent spatial encoding. We perform thorough ablation studies to examine the main components of our proposed method. On the standard benchmark PASCAL-5i, the method performs better despite being trained on frozen vision and language representations.Index Terms: Beyond-Labels, open-vocabulary semantic segmentation, Fourier embeddings, PASCAL-5i

View on arXiv

Comments on this paper