ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.20204
17
13

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

30 May 2024
Andreas Koukounas
Georgios Mastrapas
Michael Gunther
Bo Wang
Scott Martens
Isabelle Mohr
Saba Sturua
Mohammad Kalim Akram
Joan Fontanals Martínez
Saahil Ognawala
Susana Guzman
Maximilian Werk
Nan Wang
Han Xiao
    VLM
ArXivPDFHTML
Abstract

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

View on arXiv
Comments on this paper