ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

41 / 1,041 papers shown
Title
Patching open-vocabulary models by interpolating weights
Patching open-vocabulary models by interpolating weightsNeural Information Processing Systems (NeurIPS), 2022
Gabriel Ilharco
Mitchell Wortsman
S. Gadre
Shuran Song
Hannaneh Hajishirzi
Simon Kornblith
Ali Farhadi
Ludwig Schmidt
VLMKELM
323
201
0
10 Aug 2022
Self-supervised Multi-modal Training from Uncurated Image and Reports
  Enables Zero-shot Oversight Artificial Intelligence in Radiology
Self-supervised Multi-modal Training from Uncurated Image and Reports Enables Zero-shot Oversight Artificial Intelligence in Radiology
Sangjoon Park
Eunha Lee
Kyung Sook Shin
Jeonghyeon Lee
Jong Chul Ye
141
2
0
10 Aug 2022
Advancing Plain Vision Transformer Towards Remote Sensing Foundation
  Model
Advancing Plain Vision Transformer Towards Remote Sensing Foundation ModelIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022
Di Wang
Qiming Zhang
Yufei Xu
Jing Zhang
Bo Du
Dacheng Tao
Guang Dai
261
316
0
08 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLMLRMVPVLM
176
37
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation
  Learning
Masked Vision and Language Modeling for Multi-modal Representation LearningInternational Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
199
84
0
03 Aug 2022
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
  and Language Models
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
Rui Qian
Yeqing Li
Zheng Xu
Ming-Hsuan Yang
Serge Belongie
Huayu Chen
VLM
160
25
0
15 Jul 2022
Convolutional Bypasses Are Better Vision Transformer Adapters
Convolutional Bypasses Are Better Vision Transformer AdaptersEuropean Conference on Artificial Intelligence (ECAI), 2022
Shibo Jie
Zhi-Hong Deng
VPVLM
229
156
0
14 Jul 2022
Distance Learner: Incorporating Manifold Prior to Model Training
Distance Learner: Incorporating Manifold Prior to Model Training
Aditya Chetan
Nipun Kwatra
65
1
0
14 Jul 2022
Revisiting Classifier: Transferring Vision-Language Models for Video
  Recognition
Revisiting Classifier: Transferring Vision-Language Models for Video RecognitionAAAI Conference on Artificial Intelligence (AAAI), 2022
Wenhao Wu
Zhun Sun
Wanli Ouyang
VLM
346
124
0
04 Jul 2022
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
ST-Adapter: Parameter-Efficient Image-to-Video Transfer LearningNeural Information Processing Systems (NeurIPS), 2022
Junting Pan
Ziyi Lin
Xiatian Zhu
Jing Shao
Jiaming Song
320
259
0
27 Jun 2022
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu
Yuanzhong Xu
Jing Yu Koh
Thang Luong
Gunjan Baid
...
Zarana Parekh
Xin Li
Han Zhang
Jason Baldridge
Yonghui Wu
EGVM
561
1,349
0
22 Jun 2022
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner
REVECA -- Rich Encoder-decoder framework for Video Event CAptioner
Jaehyuk Heo
YongGi Jeong
Sunwoo Kim
Jaehee Kim
Pilsung Kang
98
0
0
18 Jun 2022
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal TasksInternational Conference on Learning Representations (ICLR), 2022
Jiasen Lu
Christopher Clark
Rowan Zellers
Roozbeh Mottaghi
Aniruddha Kembhavi
ObjDVLMMLLM
385
472
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language
  Representation Learning
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation LearningAAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
244
90
0
17 Jun 2022
MixGen: A New Multi-Modal Data Augmentation
MixGen: A New Multi-Modal Data Augmentation
Xiaoshuai Hao
Yi Zhu
Srikar Appalaraju
Aston Zhang
Wanqian Zhang
Boyang Li
Mu Li
VLM
361
121
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLMObjD
226
150
0
15 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
475
819
0
13 Jun 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional
  MoEs
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsNeural Information Processing Systems (NeurIPS), 2022
Jinguo Zhu
Xizhou Zhu
Wenhai Wang
Xiaohua Wang
Jiaming Song
Xiaogang Wang
Jifeng Dai
MoMeMoE
261
84
0
09 Jun 2022
Neural Collapse: A Review on Modelling Principles and Generalization
Neural Collapse: A Review on Modelling Principles and Generalization
Vignesh Kothapalli
367
103
0
08 Jun 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
  of Experts
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of ExpertsNeural Information Processing Systems (NeurIPS), 2022
Basil Mustafa
C. Riquelme
J. Puigcerver
Rodolphe Jenatton
N. Houlsby
VLMMoE
330
270
0
06 Jun 2022
Delving into the Openness of CLIP
Delving into the Openness of CLIPAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
220
15
0
04 Jun 2022
Visual Clues: Bridging Vision and Language Foundations for Image
  Paragraph Captioning
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph CaptioningNeural Information Processing Systems (NeurIPS), 2022
Yujia Xie
Luowei Zhou
Xiyang Dai
Lu Yuan
Nguyen Bach
Ce Liu
Michael Zeng
VLMMLLM
156
30
0
03 Jun 2022
VL-BEiT: Generative Vision-Language Pretraining
VL-BEiT: Generative Vision-Language Pretraining
Hangbo Bao
Wenhui Wang
Li Dong
Furu Wei
VLM
158
48
0
02 Jun 2022
Prefix Conditioning Unifies Language and Label Supervision
Prefix Conditioning Unifies Language and Label SupervisionComputer Vision and Pattern Recognition (CVPR), 2022
Kuniaki Saito
Kihyuk Sohn
Xinming Zhang
Chun-Liang Li
Chen-Yu Lee
Kate Saenko
Tomas Pfister
VLMCLIP
159
18
0
02 Jun 2022
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal
  Pre-training
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-trainingAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Yan Zeng
Wangchunshu Zhou
Ao Luo
Ziming Cheng
Xinsong Zhang
VLM
281
37
0
01 Jun 2022
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining
Pengyuan Lyu
Chengquan Zhang
Shanshan Liu
Meina Qiao
Yangliu Xu
Liang Wu
Kun Yao
Junyu Han
Errui Ding
Jingdong Wang
489
46
0
01 Jun 2022
Multimodal Masked Autoencoders Learn Transferable Representations
Multimodal Masked Autoencoders Learn Transferable Representations
Xinyang Geng
Hao Liu
Lisa Lee
Dale Schuurams
Sergey Levine
Pieter Abbeel
307
132
0
27 May 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
582
698
0
27 May 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language
  Understanding
Photorealistic Text-to-Image Diffusion Models with Deep Language UnderstandingNeural Information Processing Systems (NeurIPS), 2022
Chitwan Saharia
William Chan
Saurabh Saxena
Lala Li
Jay Whang
...
Raphael Gontijo-Lopes
Tim Salimans
Jonathan Ho
David J Fleet
Mohammad Norouzi
VLM
1.1K
7,395
0
23 May 2022
Deep transfer learning for image classification: a survey
Deep transfer learning for image classification: a survey
J. Plested
Musa Phiri
Tom Gedeon
OOD
190
46
0
20 May 2022
Training Vision-Language Transformers from Captions
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLMViT
366
11
0
19 May 2022
When does dough become a bagel? Analyzing the remaining mistakes on
  ImageNet
When does dough become a bagel? Analyzing the remaining mistakes on ImageNetNeural Information Processing Systems (NeurIPS), 2022
Vijay Vasudevan
Benjamin Caine
Raphael Gontijo-Lopes
Sara Fridovich-Keil
Rebecca Roelofs
VLMUQCV
181
69
0
09 May 2022
Unlocking High-Accuracy Differentially Private Image Classification
  through Scale
Unlocking High-Accuracy Differentially Private Image Classification through Scale
Soham De
Leonard Berrada
Jamie Hayes
Samuel L. Smith
Borja Balle
335
261
0
28 Apr 2022
CLIP-Dissect: Automatic Description of Neuron Representations in Deep
  Vision Networks
CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision NetworksInternational Conference on Learning Representations (ICLR), 2022
Tuomas P. Oikarinen
Tsui-Wei Weng
VLM
354
122
1
23 Apr 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Single-Stream Multi-Level Alignment for Vision-Language PretrainingEuropean Conference on Computer Vision (ECCV), 2022
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIPVLM
272
21
0
27 Mar 2022
Model soups: averaging weights of multiple fine-tuned models improves
  accuracy without increasing inference time
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference timeInternational Conference on Machine Learning (ICML), 2022
Mitchell Wortsman
Gabriel Ilharco
S. Gadre
Rebecca Roelofs
Raphael Gontijo-Lopes
...
Hongseok Namkoong
Ali Farhadi
Y. Carmon
Simon Kornblith
Ludwig Schmidt
MoMe
707
1,267
1
10 Mar 2022
Geodesic Multi-Modal Mixup for Robust Fine-Tuning
Geodesic Multi-Modal Mixup for Robust Fine-TuningNeural Information Processing Systems (NeurIPS), 2022
Changdae Oh
Junhyuk So
Hoyoon Byun
Yongtaek Lim
Minchul Shin
Jong-June Jeon
Kyungwoo Song
424
38
0
08 Mar 2022
Problem-dependent attention and effort in neural networks with
  applications to image resolution and model selection
Problem-dependent attention and effort in neural networks with applications to image resolution and model selectionImage and Vision Computing (IVC), 2022
Chris Rohlfs
495
5
0
05 Jan 2022
Generating More Pertinent Captions by Leveraging Semantics and Style on
  Multi-Source Datasets
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
281
14
0
24 Nov 2021
XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers For
  Convolutional Neural Networks
XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers For Convolutional Neural NetworksJournal of Intelligent and Robotic Systems (JIRS), 2021
Jian Sun
A. P. Fard
Mohammad H. Mahoor
3DPC
229
8
0
21 Nov 2021
The Computational Limits of Deep Learning
The Computational Limits of Deep Learning
Neil C. Thompson
Kristjan Greenewald
Keeheon Lee
Gabriel F. Manso
VLM
281
626
0
10 Jul 2020
Previous
123...192021