Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 1,042 papers shown
Title
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
IEEE International Conference on Computer Vision (ICCV), 2023
Jia Ning
Chen Li
Zheng Zhang
Zigang Geng
Jingdong Sun
Kun He
Han Hu
298
59
0
05 Jan 2023
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yue Han
Jiangning Zhang
Zhucun Xue
Chao Xu
Xintian Shen
Yabiao Wang
Chengjie Wang
Yong Liu
Xiangtai Li
333
22
0
03 Jan 2023
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2022
Wenhao Wu
Xiaohan Wang
Haipeng Luo
Jingdong Wang
Yi Yang
Wanli Ouyang
342
79
0
31 Dec 2022
FlatENN: Train Flat for Enhanced Fault Tolerance of Quantized Deep Neural Networks
Akul Malhotra
S. Gupta
83
0
0
29 Dec 2022
RevealED: Uncovering Pro-Eating Disorder Content on Twitter Using Deep Learning
J. Feldman
181
0
0
28 Dec 2022
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning
IEEE International Conference on Computer Vision (ICCV), 2022
Woohyun Kang
Jonghwan Mun
Sungjun Lee
Byungseok Roh
VLM
233
27
0
27 Dec 2022
Do DALL-E and Flamingo Understand Each Other?
IEEE International Conference on Computer Vision (ICCV), 2022
Hang Li
Jindong Gu
Rajat Koner
Sahand Sharifzadeh
Volker Tresp
MLLM
218
14
0
23 Dec 2022
Infrared Image Super-Resolution: Systematic Review, and Future Trends
Y. Huang
Tomo Miyazaki
Xiao-Fang Liu
S. Omachi
SupR
606
17
0
22 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Computer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLM
MLLM
ObjD
280
324
0
21 Dec 2022
ALCAP: Alignment-Augmented Music Captioner
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zihao He
Weituo Hao
Weiyi Lu
Changyou Chen
Kristina Lerman
Xuchen Song
197
1
0
21 Dec 2022
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Simone Klenk
David Bonello
Lukas Koestler
Nikita Araslanov
Zorah Lähner
256
35
0
20 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2022
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
162
46
0
19 Dec 2022
Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization
Computer Vision and Pattern Recognition (CVPR), 2022
Chen Ju
Kunhao Zheng
Jinxian Liu
Peisen Zhao
Ya Zhang
Jianlong Chang
Yanfeng Wang
Qi Tian
176
17
0
19 Dec 2022
CLIPPO: Image-and-Language Understanding from Pixels Only
Computer Vision and Pattern Recognition (CVPR), 2022
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIP
VLM
315
71
0
15 Dec 2022
Reproducible scaling laws for contrastive language-image learning
Computer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLM
CLIP
465
1,134
0
14 Dec 2022
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Computer Vision and Pattern Recognition (CVPR), 2022
Zixian Ma
Jerry Hong
Mustafa Omer Gul
Mona Gandhi
Irena Gao
Ranjay Krishna
CoGe
363
179
0
13 Dec 2022
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory
Computer Vision and Pattern Recognition (CVPR), 2022
Ziniu Hu
Ahmet Iscen
Chen Sun
Zirui Wang
Kai-Wei Chang
Luke Huan
Cordelia Schmid
David A. Ross
Alireza Fathi
RALM
VLM
310
139
0
10 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
International Conference on Learning Representations (ICLR), 2022
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMe
MoE
218
164
0
09 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
Computer Vision and Pattern Recognition (CVPR), 2022
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
268
91
0
09 Dec 2022
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning
Computer Vision and Pattern Recognition (CVPR), 2022
Jishnu Mukhoti
Tsung-Yu Lin
Omid Poursaeed
Rui Wang
Ashish Shah
Juil Sock
Ser-Nam Lim
VLM
235
116
0
09 Dec 2022
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan
Tao Zhu
Zirui Wang
Yuan Cao
Mi Zhang
Soham Ghosh
Yonghui Wu
Jiahui Yu
VLM
VGen
308
69
0
09 Dec 2022
Learning Video Representations from Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2022
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLM
AI4TS
286
226
0
08 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Computer Vision and Image Understanding (CVIU), 2022
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIP
CoGe
258
15
0
08 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
Computer Vision and Pattern Recognition (CVPR), 2022
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
224
68
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
449
440
0
06 Dec 2022
Location-Aware Self-Supervised Transformers for Semantic Segmentation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Mathilde Caron
N. Houlsby
Cordelia Schmid
ViT
305
23
0
05 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
199
2
0
02 Dec 2022
Scaling Language-Image Pre-training via Masking
Computer Vision and Pattern Recognition (CVPR), 2022
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIP
VLM
370
388
0
01 Dec 2022
GRiT: A Generative Region-to-text Transformer for Object Understanding
European Conference on Computer Vision (ECCV), 2022
Jialian Wu
Jianfeng Wang
Zhengyuan Yang
Zhe Gan
Zicheng Liu
Junsong Yuan
Lijuan Wang
ObjD
VLM
245
145
0
01 Dec 2022
Exploiting Category Names for Few-Shot Classification with Vision-Language Models
Taihong Xiao
Zirui Wang
Liangliang Cao
Jiahui Yu
Shengyang Dai
Ming-Hsuan Yang
VLM
MLLM
239
5
0
29 Nov 2022
Context-Aware Robust Fine-Tuning
International Journal of Computer Vision (IJCV), 2022
Xiaofeng Mao
YueFeng Chen
Yang Liu
Rong Zhang
Hui Xue
Zhao Li
VLM
CLIP
174
38
0
29 Nov 2022
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
IEEE International Conference on Computer Vision (ICCV), 2022
Vishaal Udandarao
Ankush Gupta
Samuel Albanie
VLM
MLLM
455
142
0
28 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Tao Gui
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjD
VLM
119
1
0
28 Nov 2022
Learning Object-Language Alignments for Open-Vocabulary Object Detection
International Conference on Learning Representations (ICLR), 2022
Chuang Lin
Pei Sun
Yi Jiang
Ping Luo
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
Jianfei Cai
VLM
ObjD
173
116
0
27 Nov 2022
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation
Kaihong Wang
Donghyun Kim
Regerio Feris
Kate Saenko
Margrit Betke
ViT
296
5
0
27 Nov 2022
Receptive Field Refinement for Convolutional Neural Networks Reliably Improves Predictive Performance
Mats L. Richter
C. Pal
154
5
0
26 Nov 2022
Differentially Private Image Classification from Features
Harsh Mehta
Walid Krichene
Abhradeep Thakurta
Alexey Kurakin
Ashok Cutkosky
227
10
0
24 Nov 2022
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
R. Burgert
Kanchana Ranasinghe
Xiang Li
Michael S. Ryoo
DiffM
VLM
270
41
0
23 Nov 2022
Mutual Information Learned Regressor: an Information-theoretic Viewpoint of Training Regression Systems
Xiaodong Wu
Q. Zhang
Zhengbo Chen
Qiaoan Liu
Weizhuo Shao
Yusen He
Yao Wang
SSL
139
0
0
23 Nov 2022
X
2
^2
2
-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLM
MLLM
235
26
0
22 Nov 2022
Multitask Vision-Language Prompt Tuning
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Sheng Shen
Shijia Yang
Tianjun Zhang
Bohan Zhai
Joseph E. Gonzalez
Kurt Keutzer
Trevor Darrell
VLM
VPVLM
272
75
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffM
VLM
255
31
0
21 Nov 2022
Neural Dependencies Emerging from Learning Massive Categories
Computer Vision and Pattern Recognition (CVPR), 2022
Ruili Feng
Kecheng Zheng
Kai Zhu
Yujun Shen
Jian Zhao
Yukun Huang
Deli Zhao
Jingren Zhou
Michael I. Jordan
Zhengjun Zha
UQCV
103
0
0
21 Nov 2022
Unifying Vision-Language Representation Space with Single-tower Transformer
AAAI Conference on Artificial Intelligence (AAAI), 2022
Jiho Jang
Chaerin Kong
D. Jeon
Seonhoon Kim
Nojun Kwak
213
25
0
21 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Computer Vision and Pattern Recognition (CVPR), 2022
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
176
49
0
21 Nov 2022
Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model
Nature Communications (Nat Commun), 2022
Jinho Chang
Jong Chul Ye
AI4CE
173
60
0
19 Nov 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Computer Vision and Pattern Recognition (CVPR), 2022
Hao Li
Jinguo Zhu
Xiaohu Jiang
Xizhou Zhu
Jiaming Song
...
Xiaohua Wang
Yu Qiao
Xiaogang Wang
Wenhai Wang
Jifeng Dai
MLLM
162
66
0
17 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Computer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
225
54
0
17 Nov 2022
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision
IEEE International Conference on Computer Vision (ICCV), 2022
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
291
35
0
17 Nov 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Limin Wang
Yu Qiao
ViT
213
151
0
17 Nov 2022
Previous
1
2
3
...
18
19
20
21
Next