ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.08802
82
6

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

11 December 2024
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
    VLM
ArXivPDFHTML
Abstract

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available atthis https URL.

View on arXiv
@article{koukounas2025_2412.08802,
  title={ jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images },
  author={ Andreas Koukounas and Georgios Mastrapas and Sedigheh Eslami and Bo Wang and Mohammad Kalim Akram and Michael Günther and Isabelle Mohr and Saba Sturua and Nan Wang and Han Xiao },
  journal={arXiv preprint arXiv:2412.08802},
  year={ 2025 }
}
Comments on this paper