ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
Title
All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
All in Tokens: Unifying Output Space of Visual Tasks via Soft TokenIEEE International Conference on Computer Vision (ICCV), 2023
Jia Ning
Chen Li
Zheng Zhang
Zigang Geng
Jingdong Sun
Kun He
Han Hu
298
59
0
05 Jan 2023
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance
  Segmentation
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yue Han
Jiangning Zhang
Zhucun Xue
Chao Xu
Xintian Shen
Yabiao Wang
Chengjie Wang
Yong Liu
Xiangtai Li
333
22
0
03 Jan 2023
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
  with Pre-trained Vision-Language Models
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022
Wenhao Wu
Xiaohan Wang
Haipeng Luo
Jingdong Wang
Yi Yang
Wanli Ouyang
342
79
0
31 Dec 2022
FlatENN: Train Flat for Enhanced Fault Tolerance of Quantized Deep
  Neural Networks
FlatENN: Train Flat for Enhanced Fault Tolerance of Quantized Deep Neural Networks
Akul Malhotra
S. Gupta
83
0
0
29 Dec 2022
RevealED: Uncovering Pro-Eating Disorder Content on Twitter Using Deep
  Learning
RevealED: Uncovering Pro-Eating Disorder Content on Twitter Using Deep Learning
J. Feldman
181
0
0
28 Dec 2022
Noise-aware Learning from Web-crawled Image-Text Data for Image
  Captioning
Noise-aware Learning from Web-crawled Image-Text Data for Image CaptioningIEEE International Conference on Computer Vision (ICCV), 2022
Woohyun Kang
Jonghwan Mun
Sungjun Lee
Byungseok Roh
VLM
233
27
0
27 Dec 2022
Do DALL-E and Flamingo Understand Each Other?
Do DALL-E and Flamingo Understand Each Other?IEEE International Conference on Computer Vision (ICCV), 2022
Hang Li
Jindong Gu
Rajat Koner
Sahand Sharifzadeh
Volker Tresp
MLLM
218
14
0
23 Dec 2022
Infrared Image Super-Resolution: Systematic Review, and Future Trends
Infrared Image Super-Resolution: Systematic Review, and Future Trends
Y. Huang
Tomo Miyazaki
Xiao-Fang Liu
S. Omachi
SupR
606
17
0
22 Dec 2022
Generalized Decoding for Pixel, Image, and Language
Generalized Decoding for Pixel, Image, and LanguageComputer Vision and Pattern Recognition (CVPR), 2022
Xueyan Zou
Zi-Yi Dou
Jianwei Yang
Zhe Gan
Linjie Li
...
Lu Yuan
Nanyun Peng
Lijuan Wang
Yong Jae Lee
Jianfeng Gao
VLMMLLMObjD
280
324
0
21 Dec 2022
ALCAP: Alignment-Augmented Music Captioner
ALCAP: Alignment-Augmented Music CaptionerConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zihao He
Weituo Hao
Weiyi Lu
Changyou Chen
Kristina Lerman
Xuchen Song
197
1
0
21 Dec 2022
Masked Event Modeling: Self-Supervised Pretraining for Event Cameras
Masked Event Modeling: Self-Supervised Pretraining for Event CamerasIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Simone Klenk
David Bonello
Lukas Koestler
Nikita Araslanov
Zorah Lähner
256
35
0
20 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Position-guided Text Prompt for Vision-Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2022
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
162
46
0
19 Dec 2022
Distilling Vision-Language Pre-training to Collaborate with
  Weakly-Supervised Temporal Action Localization
Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action LocalizationComputer Vision and Pattern Recognition (CVPR), 2022
Chen Ju
Kunhao Zheng
Jinxian Liu
Peisen Zhao
Ya Zhang
Jianlong Chang
Yanfeng Wang
Qi Tian
176
17
0
19 Dec 2022
CLIPPO: Image-and-Language Understanding from Pixels Only
CLIPPO: Image-and-Language Understanding from Pixels OnlyComputer Vision and Pattern Recognition (CVPR), 2022
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIPVLM
315
71
0
15 Dec 2022
Reproducible scaling laws for contrastive language-image learning
Reproducible scaling laws for contrastive language-image learningComputer Vision and Pattern Recognition (CVPR), 2022
Mehdi Cherti
Romain Beaumont
Ross Wightman
Mitchell Wortsman
Gabriel Ilharco
Cade Gordon
Christoph Schuhmann
Ludwig Schmidt
J. Jitsev
VLMCLIP
465
1,134
0
14 Dec 2022
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
CREPE: Can Vision-Language Foundation Models Reason Compositionally?Computer Vision and Pattern Recognition (CVPR), 2022
Zixian Ma
Jerry Hong
Mustafa Omer Gul
Mona Gandhi
Irena Gao
Ranjay Krishna
CoGe
363
179
0
13 Dec 2022
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
  Multi-Source Multimodal Knowledge Memory
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge MemoryComputer Vision and Pattern Recognition (CVPR), 2022
Ziniu Hu
Ahmet Iscen
Chen Sun
Zirui Wang
Kai-Wei Chang
Luke Huan
Cordelia Schmid
David A. Ross
Alireza Fathi
RALMVLM
310
139
0
10 Dec 2022
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Sparse Upcycling: Training Mixture-of-Experts from Dense CheckpointsInternational Conference on Learning Representations (ICLR), 2022
Aran Komatsuzaki
J. Puigcerver
James Lee-Thorp
Carlos Riquelme Ruiz
Basil Mustafa
Joshua Ainslie
Yi Tay
Mostafa Dehghani
N. Houlsby
MoMeMoE
218
164
0
09 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
VindLU: A Recipe for Effective Video-and-Language PretrainingComputer Vision and Pattern Recognition (CVPR), 2022
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
268
91
0
09 Dec 2022
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive
  Learning
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive LearningComputer Vision and Pattern Recognition (CVPR), 2022
Jishnu Mukhoti
Tsung-Yu Lin
Omid Poursaeed
Rui Wang
Ashish Shah
Juil Sock
Ser-Nam Lim
VLM
235
116
0
09 Dec 2022
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive
  Captioners
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
Shen Yan
Tao Zhu
Zirui Wang
Yuan Cao
Mi Zhang
Soham Ghosh
Yonghui Wu
Jiahui Yu
VLMVGen
308
69
0
09 Dec 2022
Learning Video Representations from Large Language Models
Learning Video Representations from Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLMAI4TS
286
226
0
08 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food
  Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding (CVIU), 2022
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIPCoGe
258
15
0
08 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video LearningComputer Vision and Pattern Recognition (CVPR), 2022
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
224
68
0
06 Dec 2022
InternVideo: General Video Foundation Models via Generative and
  Discriminative Learning
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLMVGen
449
440
0
06 Dec 2022
Location-Aware Self-Supervised Transformers for Semantic Segmentation
Location-Aware Self-Supervised Transformers for Semantic SegmentationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Mathilde Caron
N. Houlsby
Cordelia Schmid
ViT
305
23
0
05 Dec 2022
Compound Tokens: Channel Fusion for Vision-Language Representation
  Learning
Compound Tokens: Channel Fusion for Vision-Language Representation Learning
Maxwell Mbabilla Aladago
A. Piergiovanni
199
2
0
02 Dec 2022
Scaling Language-Image Pre-training via Masking
Scaling Language-Image Pre-training via MaskingComputer Vision and Pattern Recognition (CVPR), 2022
Yanghao Li
Haoqi Fan
Ronghang Hu
Christoph Feichtenhofer
Kaiming He
CLIPVLM
370
388
0
01 Dec 2022
GRiT: A Generative Region-to-text Transformer for Object Understanding
GRiT: A Generative Region-to-text Transformer for Object UnderstandingEuropean Conference on Computer Vision (ECCV), 2022
Jialian Wu
Jianfeng Wang
Zhengyuan Yang
Zhe Gan
Zicheng Liu
Junsong Yuan
Lijuan Wang
ObjDVLM
245
145
0
01 Dec 2022
Exploiting Category Names for Few-Shot Classification with
  Vision-Language Models
Exploiting Category Names for Few-Shot Classification with Vision-Language Models
Taihong Xiao
Zirui Wang
Liangliang Cao
Jiahui Yu
Shengyang Dai
Ming-Hsuan Yang
VLMMLLM
239
5
0
29 Nov 2022
Context-Aware Robust Fine-Tuning
Context-Aware Robust Fine-TuningInternational Journal of Computer Vision (IJCV), 2022
Xiaofeng Mao
YueFeng Chen
Yang Liu
Rong Zhang
Hui Xue
Zhao Li
VLMCLIP
174
38
0
29 Nov 2022
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
SuS-X: Training-Free Name-Only Transfer of Vision-Language ModelsIEEE International Conference on Computer Vision (ICCV), 2022
Vishaal Udandarao
Ankush Gupta
Samuel Albanie
VLMMLLM
455
142
0
28 Nov 2022
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
SLAN: Self-Locator Aided Network for Cross-Modal Understanding
Jiang-Tian Zhai
Tao Gui
Tong Wu
Xinghan Chen
Jiangjiang Liu
Bo Ren
Ming-Ming Cheng
ObjDVLM
119
1
0
28 Nov 2022
Learning Object-Language Alignments for Open-Vocabulary Object Detection
Learning Object-Language Alignments for Open-Vocabulary Object DetectionInternational Conference on Learning Representations (ICLR), 2022
Chuang Lin
Pei Sun
Yi Jiang
Ping Luo
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
Jianfei Cai
VLMObjD
173
116
0
27 Nov 2022
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive
  Semantic Segmentation
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation
Kaihong Wang
Donghyun Kim
Regerio Feris
Kate Saenko
Margrit Betke
ViT
296
5
0
27 Nov 2022
Receptive Field Refinement for Convolutional Neural Networks Reliably
  Improves Predictive Performance
Receptive Field Refinement for Convolutional Neural Networks Reliably Improves Predictive Performance
Mats L. Richter
C. Pal
154
5
0
26 Nov 2022
Differentially Private Image Classification from Features
Differentially Private Image Classification from Features
Harsh Mehta
Walid Krichene
Abhradeep Thakurta
Alexey Kurakin
Ashok Cutkosky
227
10
0
24 Nov 2022
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors
R. Burgert
Kanchana Ranasinghe
Xiang Li
Michael S. Ryoo
DiffMVLM
270
41
0
23 Nov 2022
Mutual Information Learned Regressor: an Information-theoretic Viewpoint
  of Training Regression Systems
Mutual Information Learned Regressor: an Information-theoretic Viewpoint of Training Regression Systems
Xiaodong Wu
Q. Zhang
Zhengbo Chen
Qiaoan Liu
Weizhuo Shao
Yusen He
Yao Wang
SSL
139
0
0
23 Nov 2022
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2^22-VLM: All-In-One Pre-trained Model For Vision-Language TasksIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Yan Zeng
Xinsong Zhang
Hang Li
Jiawei Wang
Jipeng Zhang
Hkust Wangchunshu Zhou
VLMMLLM
235
26
0
22 Nov 2022
Multitask Vision-Language Prompt Tuning
Multitask Vision-Language Prompt TuningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Sheng Shen
Shijia Yang
Tianjun Zhang
Bohan Zhai
Joseph E. Gonzalez
Kurt Keutzer
Trevor Darrell
VLMVPVLM
272
75
0
21 Nov 2022
Exploring Discrete Diffusion Models for Image Captioning
Exploring Discrete Diffusion Models for Image Captioning
Zixin Zhu
Yixuan Wei
Jianfeng Wang
Zhe Gan
Zheng Zhang
Le Wang
G. Hua
Lijuan Wang
Zicheng Liu
Han Hu
DiffMVLM
255
31
0
21 Nov 2022
Neural Dependencies Emerging from Learning Massive Categories
Neural Dependencies Emerging from Learning Massive CategoriesComputer Vision and Pattern Recognition (CVPR), 2022
Ruili Feng
Kecheng Zheng
Kai Zhu
Yujun Shen
Jian Zhao
Yukun Huang
Deli Zhao
Jingren Zhou
Michael I. Jordan
Zhengjun Zha
UQCV
103
0
0
21 Nov 2022
Unifying Vision-Language Representation Space with Single-tower
  Transformer
Unifying Vision-Language Representation Space with Single-tower TransformerAAAI Conference on Artificial Intelligence (AAAI), 2022
Jiho Jang
Chaerin Kong
D. Jeon
Seonhoon Kim
Nojun Kwak
213
25
0
21 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
  Unified Vision Language Model
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language ModelComputer Vision and Pattern Recognition (CVPR), 2022
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
176
49
0
21 Nov 2022
Bidirectional Generation of Structure and Properties Through a Single
  Molecular Foundation Model
Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation ModelNature Communications (Nat Commun), 2022
Jinho Chang
Jong Chul Ye
AI4CE
173
60
0
19 Nov 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
  Vision-Language Tasks
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2022
Hao Li
Jinguo Zhu
Xiaohu Jiang
Xizhou Zhu
Jiaming Song
...
Xiaohua Wang
Yu Qiao
Xiaogang Wang
Wenhai Wang
Jifeng Dai
MLLM
162
66
0
17 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
225
54
0
17 Nov 2022
I Can't Believe There's No Images! Learning Visual Tasks Using only
  Language Supervision
I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionIEEE International Conference on Computer Vision (ICCV), 2022
Sophia Gu
Christopher Clark
Aniruddha Kembhavi
VLM
291
35
0
17 Nov 2022
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video
  UniFormer
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
Limin Wang
Yu Qiao
ViT
213
151
0
17 Nov 2022
Previous
123...18192021
Next