Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2405.00740
Cited By
v1
v2
v3
v4 (latest)
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
30 April 2024
Samuel Lavoie
Polina Kirichenko
Mark Ibrahim
Mahmoud Assran
Andrew Gordon Wilson
Aaron Courville
Nicolas Ballas
CLIP
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Modeling Caption Diversity in Contrastive Vision-Language Pretraining"
50 / 64 papers shown
Title
Contrastive vision-language learning with paraphrasing and negation
K. Ngan
Saman Sadeghi Afgeh
Joe Townsend
Artur Garcez
VLM
120
0
0
20 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu
Di Fu
Jiaxing Zhang
Gong Yu
Jiayu Zheng
Xiaoling Hu
Dongdi Zhao
Feiyang Li
Chao Chen
Yong Cao
65
0
0
19 Nov 2025
Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning
Xiaomeng Fan
Yuchuan Mao
Zhi Gao
Yuwei Wu
Jin Chen
Yunde Jia
144
1
0
06 Oct 2025
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Tim Lebailly
Vijay Veerabadran
Satwik Kottur
Karl Ridgeway
Michael L. Iuzzolino
VLM
71
0
0
15 Sep 2025
EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition
Hugo Thimonier
Antony Perzo
Renaud Seguier
96
1
0
19 Aug 2025
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang
Ziqiang Xu
Jialiang Zhu
Qi Dai
Kai Qiu
...
Chong Luo
Tianyi Chen
Justin Wagle
Tim Franklin
Baining Guo
LRM
160
8
0
31 Jul 2025
SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Computer Vision and Pattern Recognition (CVPR), 2025
Shaoan Xie
Lingjing Kong
Yujia Zheng
Yu Yao
Zeyu Tang
Eric Xing
Guangyi Chen
Kun Zhang
VLM
186
3
0
29 Jul 2025
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
Licai Sun
Xingxun Jiang
Haoyu Chen
Yante Li
Zheng Lian
B. Liu
Yuan Zong
Wenming Zheng
Jukka M. Leppänen
Guoying Zhao
CLIP
VLM
153
1
0
28 Jul 2025
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
Chen Huang
Skyler Seto
Hadi Pouransari
Mehrdad Farajtabar
Raviteja Vemulapalli
Fartash Faghri
Oncel Tuzel
B. Theobald
Josh Susskind
CLL
236
0
0
30 May 2025
Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement
Haitao Li
Che Liu
Zhengyao Ding
Ziyi Liu
Wenqi Shao
Zhengxing Huang
172
1
0
17 May 2025
On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
Yichao Cai
Yuhang Liu
Erdun Gao
Tianjiao Jiang
Zhen Zhang
Anton van den Hengel
Javen Qinfeng Shi
528
0
0
14 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
861
0
0
11 Apr 2025
GOAL: Global-local Object Alignment Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Hyungyu Choi
Young Kyun Jang
Chanho Eom
VLM
845
6
0
22 Mar 2025
Bayesian Test-Time Adaptation for Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2025
Lihua Zhou
Mao Ye
Shuaifeng Li
Nianxin Li
Xiatian Zhu
Lei Deng
Hongbin Liu
Zhen Lei
BDL
VLM
TTA
455
9
0
12 Mar 2025
MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
Sumin Ha
Jun Hyeong Kim
Yinhua Piao
Sun Kim
299
2
0
23 Feb 2025
Demystifying CLIP Data
International Conference on Learning Representations (ICLR), 2023
Hu Xu
Saining Xie
Xiaoqing Ellen Tan
Po-Yao (Bernie) Huang
Russell Howes
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
Luke Zettlemoyer
Christoph Feichtenhofer
VLM
CLIP
507
193
0
31 Dec 2024
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande
Mohammad Sadegh Norouzzadeh
Devin Willmott
Anna Bair
Madan Ravi Ganesh
J. Zico Kolter
CLIP
VLM
265
2
0
21 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Computer Vision and Pattern Recognition (CVPR), 2024
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIP
VLM
280
34
0
20 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
Computer Vision and Pattern Recognition (CVPR), 2024
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLM
CLIP
280
19
0
04 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2024
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
613
7
0
02 Dec 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Neural Information Processing Systems (NeurIPS), 2024
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
199
4
0
31 Oct 2024
EchoPrime: A Multi-Video View-Informed Vision-Language Model for Comprehensive Echocardiography Interpretation
Milos Vukadinovic
Xiu Tang
N. Yuan
Paul Cheng
Debiao Li
Susan Cheng
Bryan He
David Ouyang
126
25
0
13 Oct 2024
DOTA: Distributional Test-Time Adaptation of Vision-Language Models
Zongbo Han
Jialong Yang
Guangyu Wang
Junfan Li
Qianli Xu
Mike Zheng Shou
Changqing Zhang
TTA
VLM
332
10
0
28 Sep 2024
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation
Computer Vision and Pattern Recognition (CVPR), 2024
Wei Chen
Lin Li
Yongqi Yang
Bin Wen
Fan Yang
Tingting Gao
Yu Wu
Long Chen
VLM
VGen
230
12
0
15 Jun 2024
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Q. Garrido
Jean Ponce
Xinlei Chen
Michael G. Rabbat
Yann LeCun
Mahmoud Assran
Nicolas Ballas
MDE
VLM
284
158
0
15 Feb 2024
Improved Baselines with Visual Instruction Tuning
Computer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
540
3,997
0
05 Oct 2023
Data Filtering Networks
International Conference on Learning Representations (ICLR), 2023
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
351
207
0
29 Sep 2023
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \
10,000 Budget; An Extra \
4,000 Unlocks 81.8% Accuracy
Xianhang Li
Zeyu Wang
Cihang Xie
CLIP
VLM
243
22
0
27 Jun 2023
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSL
CLIP
VLM
303
26
0
15 May 2023
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy Q. Vo
Marc Szafraniec
...
Edouard Grave
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
VLM
CLIP
SSL
1.0K
5,679
0
14 Apr 2023
Sigmoid Loss for Language Image Pre-Training
IEEE International Conference on Computer Vision (ICCV), 2023
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIP
VLM
1.2K
2,087
0
27 Mar 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
International Conference on Machine Learning (ICML), 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
1.1K
6,402
0
30 Jan 2023
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Computer Vision and Pattern Recognition (CVPR), 2023
Mahmoud Assran
Quentin Duval
Ishan Misra
Piotr Bojanowski
Pascal Vincent
Michael G. Rabbat
Yann LeCun
Nicolas Ballas
SSL
AI4TS
MDE
361
538
0
19 Jan 2023
Reproducible scaling laws for contrastive language-image learning
Computer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLM
CLIP
365
1,114
0
14 Dec 2022
Scaling Language-Image Pre-training via Masking
Computer Vision and Pattern Recognition (CVPR), 2022
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
306
380
0
01 Dec 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Neural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
218
149
0
15 Jun 2022
Masked Siamese Networks for Label-Efficient Learning
European Conference on Computer Vision (ECCV), 2022
Mahmoud Assran
Mathilde Caron
Ishan Misra
Piotr Bojanowski
Florian Bordes
Pascal Vincent
Armand Joulin
Michael G. Rabbat
Nicolas Ballas
SSL
274
373
0
14 Apr 2022
Conditional Prompt Learning for Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2022
Kaiyang Zhou
Jingkang Yang
Chen Change Loy
Ziwei Liu
VLM
CLIP
VPVLM
477
1,812
0
10 Mar 2022
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
International Conference on Machine Learning (ICML), 2022
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSL
VLM
ViT
416
1,014
0
07 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
International Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
426
992
0
07 Feb 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
International Conference on Machine Learning (ICML), 2022
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLM
BDL
VLM
CLIP
1.2K
5,585
0
28 Jan 2022
SLIP: Self-supervision meets Language-Image Pre-training
European Conference on Computer Vision (ECCV), 2021
Norman Mu
Alexander Kirillov
David Wagner
Saining Xie
VLM
CLIP
328
562
0
23 Dec 2021
Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Li Jing
Pascal Vincent
Yann LeCun
Yuandong Tian
SSL
269
422
0
18 Oct 2021
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
International Conference on Learning Representations (ICLR), 2021
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLM
MLLM
591
901
0
24 Aug 2021
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Neural Information Processing Systems (NeurIPS), 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
741
2,411
0
16 Jul 2021
On Feature Decorrelation in Self-Supervised Learning
IEEE International Conference on Computer Vision (ICCV), 2021
Tianyu Hua
Wenxiao Wang
Zihui Xue
Sucheng Ren
Yue Wang
Hang Zhao
SSL
OOD
421
210
0
02 May 2021
Learning Transferable Visual Models From Natural Language Supervision
International Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
2.0K
39,913
0
26 Feb 2021
Zero-Shot Text-to-Image Generation
International Conference on Machine Learning (ICML), 2021
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
680
5,871
0
24 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
International Conference on Machine Learning (ICML), 2021
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
1.3K
4,768
0
11 Feb 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
683
405
0
31 Dec 2020
1
2
Next