Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
41 / 1,041 papers shown
Title
Patching open-vocabulary models by interpolating weights
Neural Information Processing Systems (NeurIPS), 2022
Gabriel Ilharco
Mitchell Wortsman
S. Gadre
Shuran Song
Hannaneh Hajishirzi
Simon Kornblith
Ali Farhadi
Ludwig Schmidt
VLM
KELM
323
201
0
10 Aug 2022
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology
Sangjoon Park
Eunha Lee
Kyung Sook Shin
Jeonghyeon Lee
Jong Chul Ye
141
2
0
10 Aug 2022
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model
IEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022
Di Wang
Qiming Zhang
Yufei Xu
Jing Zhang
Bo Du
Dacheng Tao
Guang Dai
261
316
0
08 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
176
37
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
International Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
199
84
0
03 Aug 2022
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
Rui Qian
Yeqing Li
Zheng Xu
Ming-Hsuan Yang
Serge Belongie
Huayu Chen
VLM
160
25
0
15 Jul 2022
Convolutional Bypasses Are Better Vision Transformer Adapters
European Conference on Artificial Intelligence (ECAI), 2022
Shibo Jie
Zhi-Hong Deng
VPVLM
229
156
0
14 Jul 2022
Distance Learner: Incorporating Manifold Prior to Model Training
Aditya Chetan
Nipun Kwatra
65
1
0
14 Jul 2022
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
AAAI Conference on Artificial Intelligence (AAAI), 2022
Wenhao Wu
Zhun Sun
Wanli Ouyang
VLM
346
124
0
04 Jul 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
Neural Information Processing Systems (NeurIPS), 2022
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Jiaming Song
320
259
0
27 Jun 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
...
Zarana Parekh
Xin Li
Han Zhang
Jason Baldridge
Yonghui Wu
EGVM
561
1,349
0
22 Jun 2022
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner
Jaehyuk Heo
YongGi Jeong
Sunwoo Kim
Jaehee Kim
Pilsung Kang
98
0
0
18 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
International Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjD
VLM
MLLM
385
472
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
AAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
244
90
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
361
121
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Neural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
226
150
0
15 Jun 2022
Multimodal Learning with Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
475
819
0
13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
Neural Information Processing Systems (NeurIPS), 2022
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Jiaming Song
Xiaogang Wang
Jifeng Dai
MoMe
MoE
261
84
0
09 Jun 2022
Neural Collapse: A Review on Modelling Principles and Generalization
Vignesh Kothapalli
367
103
0
08 Jun 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Neural Information Processing Systems (NeurIPS), 2022
Basil Mustafa
C. Riquelme
J. Puigcerver
Rodolphe Jenatton
N. Houlsby
VLM
MoE
330
270
0
06 Jun 2022
Delving into the Openness of CLIP
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
220
15
0
04 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
Neural Information Processing Systems (NeurIPS), 2022
Yujia Xie
Luowei Zhou
Xiyang Dai
Lu Yuan
Nguyen Bach
Ce Liu
Michael Zeng
VLM
MLLM
156
30
0
03 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
158
48
0
02 Jun 2022
Prefix Conditioning Unifies Language and Label Supervision
Computer Vision and Pattern Recognition (CVPR), 2022
Kuniaki Saito
Kihyuk Sohn
Xinming Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
VLM
CLIP
159
18
0
02 Jun 2022
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Yan Zeng
Wangchunshu Zhou
Ao Luo
Ziming Cheng
Xinsong Zhang
VLM
281
37
0
01 Jun 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Pengyuan Lyu
Chengquan Zhang
Shanshan Liu
Meina Qiao
Yangliu Xu
Liang Wu
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang
489
46
0
01 Jun 2022
Multimodal Masked Autoencoders Learn Transferable Representations
Xinyang Geng
Hao Liu
Lisa Lee
Dale Schuurams
Sergey Levine
Pieter Abbeel
307
132
0
27 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
582
698
0
27 May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Neural Information Processing Systems (NeurIPS), 2022
Chitwan Saharia
William Chan
Saurabh Saxena
Lala Li
Jay Whang
...
Raphael Gontijo-Lopes
Tim Salimans
Jonathan Ho
David J Fleet
Mohammad Norouzi
VLM
1.1K
7,395
0
23 May 2022
Deep transfer learning for image classification: a survey
J. Plested
Musa Phiri
Tom Gedeon
OOD
190
46
0
20 May 2022
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLM
ViT
366
11
0
19 May 2022
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
Neural Information Processing Systems (NeurIPS), 2022
Vijay Vasudevan
Benjamin Caine
Raphael Gontijo-Lopes
Sara Fridovich-Keil
Rebecca Roelofs
VLM
UQCV
181
69
0
09 May 2022
Unlocking High-Accuracy Differentially Private Image Classification through Scale
Soham De
Leonard Berrada
Jamie Hayes
Samuel L. Smith
Borja Balle
335
261
0
28 Apr 2022
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks
International Conference on Learning Representations (ICLR), 2022
Tuomas P. Oikarinen
Tsui-Wei Weng
VLM
354
122
1
23 Apr 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
European Conference on Computer Vision (ECCV), 2022
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
272
21
0
27 Mar 2022
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
International Conference on Machine Learning (ICML), 2022
Mitchell Wortsman
Gabriel Ilharco
S. Gadre
Rebecca Roelofs
Raphael Gontijo-Lopes
...
Hongseok Namkoong
Ali Farhadi
Y. Carmon
Simon Kornblith
Ludwig Schmidt
MoMe
707
1,267
1
10 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Neural Information Processing Systems (NeurIPS), 2022
Changdae Oh
Junhyuk So
Hoyoon Byun
Yongtaek Lim
Minchul Shin
Jong-June Jeon
Kyungwoo Song
424
38
0
08 Mar 2022
Problem-dependent attention and effort in neural networks with applications to image resolution and model selection
Image and Vision Computing (IVC), 2022
Chris Rohlfs
495
5
0
05 Jan 2022
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
281
14
0
24 Nov 2021
XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers For Convolutional Neural Networks
Journal of Intelligent and Robotic Systems (JIRS), 2021
Jian Sun
A. P. Fard
Mohammad H. Mahoor
3DPC
229
8
0
21 Nov 2021
The Computational Limits of Deep Learning
Neil C. Thompson
Kristjan Greenewald
Keeheon Lee
Gabriel F. Manso
VLM
281
626
0
10 Jul 2020
Previous
1
2
3
...
19
20
21