v1v2 (latest)

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Neural Information Processing Systems (NeurIPS), 2022

3 March 2022

ArXiv (abs)PDF HTML Github (33247★)

Papers citing "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning"

50 / 368 papers shown

SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP

197

10 Apr 2026

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

240

30 Mar 2026

DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment AnalysisIEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS), 2025

152

05 Dec 2025

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

204

04 Dec 2025

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

166

03 Dec 2025

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

299

03 Dec 2025

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Cen Lu

Yung-Chen Tang

Andrea Cavallaro

30 Nov 2025

Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples

Shuhei Yamashita

Daiki Shirafuji

Tatsuhiko Saito

116

27 Nov 2025

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele DellÉrba

Andrew D. Bagdanov

227

25 Nov 2025

Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation

158

25 Nov 2025

UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

310

24 Nov 2025

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

201

21 Nov 2025

uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data

222

17 Nov 2025

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

133

10 Nov 2025

MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition

209

09 Nov 2025

On the Brittleness of CLIP Text Encoders

Allie Tran

Luca Rossetto

293

06 Nov 2025

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

198

04 Nov 2025

A Retrospect to Multi-prompt Learning across Vision and LanguageIEEE International Conference on Computer Vision (ICCV), 2023

480

31 Oct 2025

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

186

31 Oct 2025

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Aaqil Ahamed

Udaya S.K.P. Miriya Thanthrige

Ranga Rodrigo

Muhammad Haris Khan

VLM

272

30 Oct 2025

T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning

257

27 Oct 2025

Data-Centric Lessons To Improve Speech-Language Pretraining

185

22 Oct 2025

Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity

152

17 Oct 2025

DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models

259

16 Oct 2025

QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps

Matti Pekkanen

Francesco Verdoja

Ville Kyrki

165

16 Oct 2025

When Embedding Models Meet: Procrustes Bounds and Applications

Lucas Maystre

Alvaro Ortega Gonzalez

199

15 Oct 2025

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

135

14 Oct 2025

Lifting Manifolds to Mitigate Pseudo-Alignment in LLM4TS

Liangwei Nathan Zheng

170

14 Oct 2025

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

150

13 Oct 2025

Self-Supervised Representation Learning with ID-Content Modality Alignment for Sequential Recommendation

197

12 Oct 2025

DREAM: A Benchmark Study for Deepfake photoREalism AssessMent

244

11 Oct 2025

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Jisu Han

Wonjun Hwang

VLM

224

10 Oct 2025

Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

241

06 Oct 2025

Mitigating Modal Imbalance in Multimodal Reasoning

186

02 Oct 2025

Generalized Contrastive Learning for Universal Multimodal Retrieval

240

30 Sep 2025

Semantic Compression via Multimodal Representation Learning

195

29 Sep 2025

Hierarchical Representation Matching for CLIP-based Class-Incremental Learning

209

26 Sep 2025

LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision

Debargha Ganguly

Sumit Kumar

Ishwar B Balappanawar

Weicong Chen

Shashank Kambhatla

Srinivasan Iyengar

Shivkumar Kalyanaraman

Ponnurangam Kumaraguru

Vipin Chaudhary

VLM

247

26 Sep 2025

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

385

24 Sep 2025

Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation

191

23 Sep 2025

A Modality-Aware Cooperative Co-Evolutionary Framework for Multimodal Graph Neural Architecture Search

133

23 Sep 2025

Global Minimizers of Sigmoid Contrastive Loss

231

23 Sep 2025

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Georgios Tzimiropoulos

VLM

182

23 Sep 2025

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

267

22 Sep 2025

ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied Agents

278

20 Sep 2025

SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

144

16 Sep 2025

Cross-Modal Retrieval with Cauchy-Schwarz Divergence

Jiahao Zhang

Wenzhe Yin

Shujian Yu

175

15 Sep 2025

Lost in Embeddings: Information Loss in Vision-Language Models

169

15 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

515

12 Sep 2025

Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language ModelsIEEE journal of biomedical and health informatics (JBHI), 2025

165

11 Sep 2025