ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.03703
33
0

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

6 May 2025
François Role
Sébastien Meyer
Victor Amblard
    VLM
ArXivPDFHTML
Abstract

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper's abstract.

View on arXiv
@article{role2025_2505.03703,
  title={ Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning },
  author={ François Role and Sébastien Meyer and Victor Amblard },
  journal={arXiv preprint arXiv:2505.03703},
  year={ 2025 }
}
Comments on this paper