ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.00740
  4. Cited By
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
v1v2v3v4 (latest)

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

30 April 2024
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
    CLIPVLM
ArXiv (abs)PDFHTML

Papers citing "Modeling Caption Diversity in Contrastive Vision-Language Pretraining"

50 / 64 papers shown
Title
Contrastive vision-language learning with paraphrasing and negation
K. Ngan
Saman Sadeghi Afgeh
Joe Townsend
Artur Garcez
VLM
120
0
0
20 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu
Di Fu
Jiaxing Zhang
Gong Yu
Jiayu Zheng
Xiaoling Hu
Dongdi Zhao
Feiyang Li
Chao Chen
Yong Cao
65
0
0
19 Nov 2025
Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning
Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning
Xiaomeng Fan
Yuchuan Mao
Zhi Gao
Yuwei Wu
Jin Chen
Yunde Jia
144
1
0
06 Oct 2025
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Tim Lebailly
Vijay Veerabadran
Satwik Kottur
Karl Ridgeway
Michael L. Iuzzolino
VLM
71
0
0
15 Sep 2025
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Hugo Thimonier
Antony Perzo
Renaud Seguier
96
1
0
19 Aug 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang
Ziqiang Xu
Jialiang Zhu
Qi Dai
Kai Qiu
...
Chong Luo
Tianyi Chen
Justin Wagle
Tim Franklin
Baining Guo
LRM
160
8
0
31 Jul 2025
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
SmartCLIP: Modular Vision-language Alignment with Identification GuaranteesComputer Vision and Pattern Recognition (CVPR), 2025
Shaoan Xie
Lingjing Kong
Yujia Zheng
Yu Yao
Zeyu Tang
Eric Xing
Guangyi Chen
Kun Zhang
VLM
186
3
0
29 Jul 2025
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Licai Sun
Xingxun Jiang
Haoyu Chen
Yante Li
Zheng Lian
B. Liu
Yuan Zong
Wenming Zheng
Jukka M. Leppänen
Guoying Zhao
CLIPVLM
153
1
0
28 Jul 2025
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Chen Huang
Skyler Seto
Hadi Pouransari
Mehrdad Farajtabar
Raviteja Vemulapalli
Fartash Faghri
Oncel Tuzel
B. Theobald
Josh Susskind
CLL
236
0
0
30 May 2025
Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement
Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement
Haitao Li
Che Liu
Zhengyao Ding
Ziyi Liu
Wenqi Shao
Zhengxing Huang
172
1
0
17 May 2025
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
Yichao Cai
Yuhang Liu
Erdun Gao
Tianjiao Jiang
Zhen Zhang
Anton van den Hengel
Javen Qinfeng Shi
528
0
0
14 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
861
0
0
11 Apr 2025
GOAL: Global-local Object Alignment Learning
GOAL: Global-local Object Alignment LearningComputer Vision and Pattern Recognition (CVPR), 2025
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
845
6
0
22 Mar 2025
Bayesian Test-Time Adaptation for Vision-Language Models
Bayesian Test-Time Adaptation for Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Lihua Zhou
Mao Ye
Shuaifeng Li
Nianxin Li
Xiatian Zhu
Lei Deng
Hongbin Liu
Zhen Lei
BDLVLMTTA
455
9
0
12 Mar 2025
MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
Sumin Ha
Jun Hyeong Kim
Yinhua Piao
Sun Kim
299
2
0
23 Feb 2025
Demystifying CLIP Data
Demystifying CLIP DataInternational Conference on Learning Representations (ICLR), 2023
Hu Xu
Saining Xie
Xiaoqing Ellen Tan
Po-Yao (Bernie) Huang
Russell Howes
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
Luke Zettlemoyer
Christoph Feichtenhofer
VLMCLIP
507
193
0
31 Dec 2024
HyperCLIP: Adapting Vision-Language models with Hypernetworks
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande
Mohammad Sadegh Norouzzadeh
Devin Willmott
Anna Bair
Madan Ravi Ganesh
J. Zico Kolter
CLIPVLM
265
2
0
21 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level
  Vision-Language Alignment
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language AlignmentComputer Vision and Pattern Recognition (CVPR), 2024
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIPVLM
280
34
0
20 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
FLAIR: VLM with Fine-grained Language-informed Image RepresentationsComputer Vision and Pattern Recognition (CVPR), 2024
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLMCLIP
280
19
0
04 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
613
7
0
02 Dec 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream
  Generalization of CLIP
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIPNeural Information Processing Systems (NeurIPS), 2024
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
199
4
0
31 Oct 2024
EchoPrime: A Multi-Video View-Informed Vision-Language Model for
  Comprehensive Echocardiography Interpretation
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation
Milos Vukadinovic
Xiu Tang
N. Yuan
Paul Cheng
Debiao Li
Susan Cheng
Bryan He
David Ouyang
126
25
0
13 Oct 2024
DOTA: Distributional Test-Time Adaptation of Vision-Language Models
DOTA: Distributional Test-Time Adaptation of Vision-Language Models
Zongbo Han
Jialong Yang
Guangyu Wang
Junfan Li
Qianli Xu
Mike Zheng Shou
Changqing Zhang
TTAVLM
332
10
0
28 Sep 2024
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and GenerationComputer Vision and Pattern Recognition (CVPR), 2024
Wei Chen
Lin Li
Yongqi Yang
Bin Wen
Fan Yang
Tingting Gao
Yu Wu
Long Chen
VLMVGen
230
12
0
15 Jun 2024
Revisiting Feature Prediction for Learning Visual Representations from
  Video
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDEVLM
284
158
0
15 Feb 2024
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLMMLLM
540
3,997
0
05 Oct 2023
Data Filtering Networks
Data Filtering NetworksInternational Conference on Learning Representations (ICLR), 2023
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
351
207
0
29 Sep 2023
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
  within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \10,000 Budget; An Extra \4,000 Unlocks 81.8% Accuracy
Xianhang Li
Zeyu Wang
Cihang Xie
CLIPVLM
243
22
0
27 Jun 2023
Improved baselines for vision-language pre-training
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSLCLIPVLM
303
26
0
15 May 2023
DINOv2: Learning Robust Visual Features without Supervision
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy Q. Vo
Marc Szafraniec
...
Edouard Grave
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
VLMCLIPSSL
1.0K
5,679
0
14 Apr 2023
Sigmoid Loss for Language Image Pre-Training
Sigmoid Loss for Language Image Pre-TrainingIEEE International Conference on Computer Vision (ICCV), 2023
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIPVLM
1.2K
2,087
0
27 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
1.1K
6,402
0
30 Jan 2023
Self-Supervised Learning from Images with a Joint-Embedding Predictive
  Architecture
Self-Supervised Learning from Images with a Joint-Embedding Predictive ArchitectureComputer Vision and Pattern Recognition (CVPR), 2023
Mahmoud Assran
Quentin Duval
Ishan Misra
Piotr Bojanowski
Pascal Vincent
Michael G. Rabbat
Yann LeCun
Nicolas Ballas
SSLAI4TSMDE
361
538
0
19 Jan 2023
Reproducible scaling laws for contrastive language-image learning
Reproducible scaling laws for contrastive language-image learningComputer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLMCLIP
365
1,114
0
14 Dec 2022
Scaling Language-Image Pre-training via Masking
Scaling Language-Image Pre-training via MaskingComputer Vision and Pattern Recognition (CVPR), 2022
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIPVLM
306
380
0
01 Dec 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLMObjD
218
149
0
15 Jun 2022
Masked Siamese Networks for Label-Efficient Learning
Masked Siamese Networks for Label-Efficient LearningEuropean Conference on Computer Vision (ECCV), 2022
Mahmoud Assran
Mathilde Caron
Ishan Misra
Piotr Bojanowski
Florian Bordes
Pascal Vincent
Armand Joulin
Michael G. Rabbat
Nicolas Ballas
SSL
274
373
0
14 Apr 2022
Conditional Prompt Learning for Vision-Language Models
Conditional Prompt Learning for Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
VLMCLIPVPVLM
477
1,812
0
10 Mar 2022
data2vec: A General Framework for Self-supervised Learning in Speech,
  Vision and Language
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and LanguageInternational Conference on Machine Learning (ICML), 2022
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSLVLMViT
416
1,014
0
07 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkInternational Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLMObjD
426
992
0
07 Feb 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationInternational Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLMBDLVLMCLIP
1.2K
5,585
0
28 Jan 2022
SLIP: Self-supervision meets Language-Image Pre-training
SLIP: Self-supervision meets Language-Image Pre-trainingEuropean Conference on Computer Vision (ECCV), 2021
Norman Mu
Alexander Kirillov
David Wagner
Saining Xie
VLMCLIP
328
562
0
23 Dec 2021
Understanding Dimensional Collapse in Contrastive Self-supervised
  Learning
Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Li Jing
Pascal Vincent
Yann LeCun
Yuandong Tian
SSL
269
422
0
18 Oct 2021
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
SimVLM: Simple Visual Language Model Pretraining with Weak SupervisionInternational Conference on Learning Representations (ICLR), 2021
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLMMLLM
591
901
0
24 Aug 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationNeural Information Processing Systems (NeurIPS), 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
741
2,411
0
16 Jul 2021
On Feature Decorrelation in Self-Supervised Learning
On Feature Decorrelation in Self-Supervised LearningIEEE International Conference on Computer Vision (ICCV), 2021
Tianyu Hua
Wenxiao Wang
Zihui Xue
Sucheng Ren
Yue Wang
Hang Zhao
SSLOOD
421
210
0
02 May 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language SupervisionInternational Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
2.0K
39,913
0
26 Feb 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image GenerationInternational Conference on Machine Learning (ICML), 2021
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
680
5,871
0
24 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy
  Text Supervision
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionInternational Conference on Machine Learning (ICML), 2021
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLMCLIP
1.3K
4,768
0
11 Feb 2021
UNIMO: Towards Unified-Modal Understanding and Generation via
  Cross-Modal Contrastive Learning
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
683
405
0
31 Dec 2020
12
Next