Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 1,043 papers shown
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Xin Xiao
Bohong Wu
Jiacong Wang
Chunyuan Li
Xun Zhou
Haoyuan Guo
VLM
185
20
0
28 May 2024
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu
Quanzeng You
Xiaotian Han
Yongfei Liu
Huaibo Huang
Ran He
Hongxia Yang
131
4
0
28 May 2024
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Cristian Rodriguez-Opazo
Ehsan Abbasnejad
Damien Teney
Edison Marrese-Taylor
Hamed Damirchi
Anton Van Den Hengel
VLM
348
1
0
27 May 2024
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts
Mohammed Nowaz Rabbani Chowdhury
Meng Wang
Kaoutar El Maghraoui
Naigang Wang
Pin-Yu Chen
Christopher Carothers
MoE
409
10
0
26 May 2024
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text
Han Yu
Peikun Guo
Akane Sano
212
34
0
26 May 2024
DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution
Yuzhong Zhao
Feng Liu
Yue Liu
Mingxiang Liao
Chen Gong
QiXiang Ye
Fang Wan
ObjD
212
0
0
25 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
910
169
0
23 May 2024
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angeline Pouget
Lucas Beyer
Emanuele Bugliarello
Xiao Wang
Andreas Steiner
Xiao-Qi Zhai
Ibrahim Alabdulmohsin
VLM
273
13
0
22 May 2024
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H.J. Lee
Jacob M. Montgomery
Calvin K. Lai
VLM
196
0
0
22 May 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Zhaojian Yu
Yinghao Wu
Zhuotao Deng
Yansong Tang
Jinqiang Cui
217
6
0
21 May 2024
Transcriptomics-guided Slide Representation Learning in Computational Pathology
Computer Vision and Pattern Recognition (CVPR), 2024
Guillaume Jaume
Lukas Oldenburg
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Thomas Peeters
Andrew H. Song
Faisal Mahmood
299
59
0
19 May 2024
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude
Prajwal K R
Liliane Momeni
Hannah Bull
Samuel Albanie
Andrew Zisserman
Gül Varol
SLR
326
8
0
16 May 2024
PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology
Eugene Vorontsov
Adam Casson
Kristen Severson
Eric Zimmermann
Yi Kan Wang
...
Peter Hamilton
William A. Moye
Eugene Vorontsov
Siqi Liu
Thomas J. Fuchs
MedIm
311
66
0
16 May 2024
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Oncel Tuzel
VLM
CLIP
266
9
0
14 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Computer Vision and Pattern Recognition (CVPR), 2024
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
312
15
0
14 May 2024
All in One Framework for Multimodal Re-identification in the Wild
He Li
Mang Ye
Ming Zhang
Bo Du
291
28
0
08 May 2024
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Lorenzo Agnolucci
Alberto Baldrati
Marco Bertini
Marco Bertini
377
22
0
05 May 2024
Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
International Conference on Machine Learning (ICML), 2024
Yifei Ming
Yixuan Li
VLM
294
9
0
02 May 2024
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai
Pichao Wang
Tianjun Xiao
Tong He
Zongbo Han
Zheng Zhang
Mike Zheng Shou
VLM
LRM
653
306
0
29 Apr 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
272
8
0
26 Apr 2024
Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class
Mazda Moayeri
Michael G. Rabbat
Mark Ibrahim
Diane Bouchacourt
VLM
226
5
0
25 Apr 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
996
33
0
25 Apr 2024
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman
Stefan Lee
Scott D. Cohen
Kushal Kafle
VLM
181
9
0
24 Apr 2024
MoDE: CLIP Data Experts via Clustering
Jiawei Ma
Po-Yao Huang
Saining Xie
Shang-Wen Li
Luke Zettlemoyer
Shih-Fu Chang
Anuj Kumar
Hu Xu
MoE
CLIP
VLM
261
25
0
24 Apr 2024
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani
Bac Nguyen
Samuel Lavoie
Ranjay Krishna
Aaron Courville
259
2
0
24 Apr 2024
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Sachin Mehta
Maxwell Horton
Fartash Faghri
Mohammad Hossein Sekhavat
Mahyar Najibi
Mehrdad Farajtabar
Oncel Tuzel
Mohammad Rastegari
VLM
CLIP
187
9
0
24 Apr 2024
Reconstructing the Image Stitching Pipeline: Integrating Fusion and Rectangling into a Unified Inpainting Model
Ziqi Xie
Weidong Zhao
Xianhui Liu
Jian Zhao
Ning Jia
228
7
0
23 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
205
14
0
23 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
312
33
0
22 Apr 2024
Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks
Eri Hosonuma
Taku Yamazaki
Takumi Miyoshi
Akihito Taya
Yuuki Nishiyama
K. Sezaki
DiffM
276
7
0
17 Apr 2024
Vocabulary-free Image Classification and Semantic Segmentation
Alessandro Conti
Enrico Fini
Goran Frehse
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
221
7
0
16 Apr 2024
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Weronika Hryniewska-Guzik
Luca Longo
P. Biecek
FAtt
206
2
0
16 Apr 2024
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
Xiao Zhou
Xiaoman Zhang
Chaoyi Wu
Ya Zhang
Weidi Xie
Yanfeng Wang
VLM
290
13
0
15 Apr 2024
The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning
Yaohui Li
Qifeng Zhou
Haoxing Chen
Jianbing Zhang
Xinyu Dai
Hao Zhou
VLM
276
1
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
248
19
0
14 Apr 2024
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
420
17
0
14 Apr 2024
AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
Yuwei Tang
Zhenyi Lin
Qilong Wang
Q. Hu
Qinghua Hu
204
24
0
13 Apr 2024
ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
Otto Brookes
Majid Mirmehdi
H. Kühl
T. Burghardt
157
5
0
13 Apr 2024
PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification
Zhenwei Wang
Qiule Sun
Bingbing Zhang
Pengfei Wang
Jianxin Zhang
Qiang Zhang
VLM
310
4
0
13 Apr 2024
COCONut: Modernizing COCO Segmentation
XueQing Deng
Qihang Yu
Peng Wang
Xiaohui Shen
Liang-Chieh Chen
206
22
0
12 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
232
9
0
12 Apr 2024
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Tianyu Zhu
M. Jung
Jesse Clark
433
4
0
12 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
518
4
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
European Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
308
58
0
10 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
191
3
0
09 Apr 2024
Test-Time Zero-Shot Temporal Action Localization
Benedetta Liberatori
Alessandro Conti
Paolo Rota
Yiming Wang
Elisa Ricci
303
11
0
08 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
220
14
0
07 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Neural Information Processing Systems (NeurIPS), 2024
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Jiaming Song
VLM
459
54
0
04 Apr 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Neural Information Processing Systems (NeurIPS), 2024
Vishaal Udandarao
Christian Schroeder de Witt
Adhiraj Ghosh
Yash Sharma
Juil Sock
Adel Bibi
Samuel Albanie
Matthias Bethge
VLM
705
81
0
04 Apr 2024
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions
IEEE Reviews in Biomedical Engineering (RBME), 2024
Yuting He
Fuxiang Huang
Xinrui Jiang
Yuxiang Nie
Minghao Wang
Jiguang Wang
Hao Chen
LM&MA
AI4CE
365
95
0
04 Apr 2024
Previous
1
2
3
...
7
8
9
...
19
20
21
Next
Page 8 of 21
Page
of 21
Go