Continual Cross-Modal Generalization

1 April 2025

Abstract

Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

View on arXiv

@article{xia2025_2504.00561,
  title={ Continual Cross-Modal Generalization },
  author={ Yan Xia and Hai Huang and Minghui Fang and Zhou Zhao },
  journal={arXiv preprint arXiv:2504.00561},
  year={ 2025 }
}

Comments on this paper