Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 1,042 papers shown
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Neural Information Processing Systems (NeurIPS), 2023
Sangwoo Mo
Minkyu Kim
Kyungmin Lee
Jinwoo Shin
VLM
CLIP
350
38
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model
IEEE Transactions on Image Processing (IEEE TIP), 2023
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIP
VLM
369
45
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
IEEE transactions on multimedia (IEEE TMM), 2023
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLM
CLIP
293
23
0
22 May 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Neural Information Processing Systems (NeurIPS), 2023
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
589
90
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
196
9
0
22 May 2023
Gloss-Free End-to-End Sign Language Translation
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Kezhou Lin
Xiaohan Wang
Linchao Zhu
Ke Sun
Bang Zhang
Yezhou Yang
SLR
232
36
0
22 May 2023
Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach
ACM Multimedia (ACM MM), 2023
Haoning Wu
Erli Zhang
Liang Liao
Chaofeng Chen
Jingwen Hou
Annan Wang
Wenxiu Sun
Qiong Yan
Weisi Lin
193
58
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
154
4
0
21 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
ACM Multimedia (ACM MM), 2023
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
202
3
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
588
154
0
18 May 2023
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts
Asian Conference on Computer Vision (ACCV), 2023
Qiuhui Chen
Xinyue Hu
Zirui Wang
Yi Hong
LM&MA
MedIm
174
67
0
18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Neural Information Processing Systems (NeurIPS), 2023
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
568
116
0
17 May 2023
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSL
CLIP
VLM
387
26
0
15 May 2023
OneCAD: One Classifier for All image Datasets using multimodal learning
S. Wadekar
Eugenio Culurciello
282
0
0
11 May 2023
Simple Token-Level Confidence Improves Caption Correctness
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Suzanne Petryk
Spencer Whitehead
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
244
10
0
11 May 2023
An Inverse Scaling Law for CLIP Training
Neural Information Processing Systems (NeurIPS), 2023
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
311
77
0
11 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Computer Vision and Pattern Recognition (CVPR), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
ViT
VLM
416
110
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Neural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
397
200
0
11 May 2023
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Neural Information Processing Systems (NeurIPS), 2023
Hassan Akbari
Dan Kondratyuk
Huayu Chen
Rachel Hornung
Jian Shu
Hartwig Adam
VLM
MoE
291
23
0
10 May 2023
Visual Tuning
ACM Computing Surveys (ACM Comput. Surv.), 2023
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
438
60
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
Computer Vision and Pattern Recognition (CVPR), 2023
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
553
1,305
0
09 May 2023
Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness
Liangliang Cao
Bowen Zhang
Chen Chen
Yinfei Yang
Xianzhi Du
Wen‐Cheng Zhang
Zhiyun Lu
Yantao Zheng
CLIP
VLM
183
17
0
08 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
AAAI Conference on Artificial Intelligence (AAAI), 2023
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
308
49
0
06 May 2023
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
IEEE International Conference on Computer Vision (ICCV), 2023
Mathis Petrovich
Michael J. Black
Gül Varol
VGen
346
154
0
02 May 2023
Multimodal Neural Databases
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Giovanni Trappolini
Andrea Santilli
Emanuele Rodolà
A. Halevy
Fabrizio Silvestri
246
11
0
02 May 2023
What Do Self-Supervised Vision Transformers Learn?
International Conference on Learning Representations (ICLR), 2023
Namuk Park
Wonjae Kim
Byeongho Heo
Taekyung Kim
Sangdoo Yun
SSL
300
103
1
01 May 2023
Adversarial Representation Learning for Robust Privacy Preservation in Audio
IEEE Open Journal of Signal Processing (IEEE Open J. Signal Process.), 2023
Shayan Gharib
Minh Tran
Diep Luong
Konstantinos Drossos
Maria Sandsten
AAML
218
7
0
29 Apr 2023
An Empirical Study of Multimodal Model Merging
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
335
52
0
28 Apr 2023
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Haoning Wu
Liang Liao
Annan Wang
Chaofeng Chen
Jingwen Hou
Wenxiu Sun
Qiong Yan
Weisi Lin
215
15
0
28 Apr 2023
Retrieval-based Knowledge Augmented Vision Language Pre-training
ACM Multimedia (ACM MM), 2023
Jiahua Rao
Zifei Shan
Long Liu
Yao Zhou
Yuedong Yang
VLM
299
22
0
27 Apr 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Seulki Park
Daeho Um
Hajung Yoon
Sanghyuk Chun
Sangdoo Yun
Hawook Jeong
398
5
0
21 Apr 2023
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Computer Vision and Pattern Recognition (CVPR), 2023
Yihao Chen
Xianbiao Qi
Jianan Wang
Lei Zhang
175
24
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
396
152
0
17 Apr 2023
Permutation Equivariance of Transformers and Its Applications
Computer Vision and Pattern Recognition (CVPR), 2023
Hengyuan Xu
Liyao Xiang
Hang Ye
Dixi Yao
Pengzhi Chu
Baochun Li
326
25
0
16 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Chinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2023
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
166
14
0
14 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Computer Vision and Pattern Recognition (CVPR), 2023
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
159
62
0
13 Apr 2023
RECLIP: Resource-efficient CLIP by Training with Small Images
Runze Li
Dahun Kim
B. Bhanu
Weicheng Kuo
VLM
CLIP
264
17
0
12 Apr 2023
Gradient-Free Textual Inversion
ACM Multimedia (ACM MM), 2023
Zhengcong Fei
Mingyuan Fan
Junshi Huang
DiffM
260
38
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
116
5
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Computer Vision and Pattern Recognition (CVPR), 2023
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM
3DV
265
30
0
11 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Computer Vision and Pattern Recognition (CVPR), 2023
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
300
7
0
09 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce
Computer Vision and Pattern Recognition (CVPR), 2023
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
168
20
0
06 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
Computer Vision and Pattern Recognition (CVPR), 2023
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
347
36
0
05 Apr 2023
Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks
ACM Transactions on Software Engineering and Methodology (TOSEM), 2023
Michael Weiss
Paolo Tonella
AI4CE
191
1
0
05 Apr 2023
ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules
IEEE International Conference on Computer Vision (ICCV), 2023
Zhi-Qi Cheng
Qianwen Dai
Siyao Li
Yuxuan Zhou
Teruko Mitamura
Alexander G. Hauptmann
218
28
0
05 Apr 2023
Uncertainty estimation in Deep Learning for Panoptic segmentation
Michael J. Smith
F. Ferrie
OOD
UQCV
178
0
0
04 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLM
AI4TS
267
8
0
04 Apr 2023
Black Box Few-Shot Adaptation for Vision-Language models
IEEE International Conference on Computer Vision (ICCV), 2023
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
247
45
0
04 Apr 2023
Exploring Vision-Language Models for Imbalanced Learning
International Journal of Computer Vision (IJCV), 2023
Yidong Wang
Zhuohao Yu
Yongfeng Zhang
Qiang Heng
Haoxing Chen
Wei Ye
Rui Xie
Xingxu Xie
Shi-Bo Zhang
VLM
308
52
0
04 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
499
1,014
0
03 Apr 2023
Previous
1
2
3
...
15
16
17
...
19
20
21
Next