ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
  Captions
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist CaptionsNeural Information Processing Systems (NeurIPS), 2023
Sangwoo Mo
Minkyu Kim
Kyungmin Lee
Jinwoo Shin
VLMCLIP
350
38
0
23 May 2023
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained
  Vision-Language Model
CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language ModelIEEE Transactions on Image Processing (IEEE TIP), 2023
Shuai Zhao
Xiaohan Wang
Linchao Zhu
Yezhou Yang
CLIPVLM
369
45
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and
  Blending
VLAB: Enhancing Video Language Pre-training by Feature Adapting and BlendingIEEE transactions on multimedia (IEEE TMM), 2023
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
Qingbin Liu
Jiashi Feng
VLMCLIP
293
23
0
22 May 2023
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model DesignNeural Information Processing Systems (NeurIPS), 2023
Ibrahim Alabdulmohsin
Xiaohua Zhai
Alexander Kolesnikov
Lucas Beyer
VLM
589
90
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large
  Language Models
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
196
9
0
22 May 2023
Gloss-Free End-to-End Sign Language Translation
Gloss-Free End-to-End Sign Language TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Kezhou Lin
Xiaohan Wang
Linchao Zhu
Ke Sun
Bang Zhang
Yezhou Yang
SLR
232
36
0
22 May 2023
Towards Explainable In-the-Wild Video Quality Assessment: A Database and
  a Language-Prompted Approach
Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted ApproachACM Multimedia (ACM MM), 2023
Haoning Wu
Erli Zhang
Liang Liao
Chaofeng Chen
Jingwen Hou
Annan Wang
Wenxiu Sun
Qiong Yan
Weisi Lin
193
58
0
22 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language,
  and Speech Data
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
154
4
0
21 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner
  and Dense Captioner
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense CaptionerACM Multimedia (ACM MM), 2023
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
Qingbin Liu
202
3
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
  Modalities
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLMMLLMObjD
588
154
0
18 May 2023
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical
  Images and Texts
MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and TextsAsian Conference on Computer Vision (ACCV), 2023
Qiuhui Chen
Xinyue Hu
Zirui Wang
Yi Hong
LM&MAMedIm
174
67
0
18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
What You See is What You Read? Improving Text-Image Alignment EvaluationNeural Information Processing Systems (NeurIPS), 2023
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
568
116
0
17 May 2023
Improved baselines for vision-language pre-training
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSLCLIPVLM
387
26
0
15 May 2023
OneCAD: One Classifier for All image Datasets using multimodal learning
OneCAD: One Classifier for All image Datasets using multimodal learning
S. Wadekar
Eugenio Culurciello
282
0
0
11 May 2023
Simple Token-Level Confidence Improves Caption Correctness
Simple Token-Level Confidence Improves Caption CorrectnessIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Suzanne Petryk
Spencer Whitehead
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
244
10
0
11 May 2023
An Inverse Scaling Law for CLIP Training
An Inverse Scaling Law for CLIP TrainingNeural Information Processing Systems (NeurIPS), 2023
Xianhang Li
Zeyu Wang
Cihang Xie
VLMCLIP
311
77
0
11 May 2023
Region-Aware Pretraining for Open-Vocabulary Object Detection with
  Vision Transformers
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersComputer Vision and Pattern Recognition (CVPR), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDViTVLM
416
110
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question AnsweringNeural Information Processing Systems (NeurIPS), 2023
Shoubin Yu
Jaemin Cho
Prateek Yadav
Joey Tianyi Zhou
397
200
0
11 May 2023
Alternating Gradient Descent and Mixture-of-Experts for Integrated
  Multimodal Perception
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal PerceptionNeural Information Processing Systems (NeurIPS), 2023
Hassan Akbari
Dan Kondratyuk
Huayu Chen
Rachel Hornung
Jian Shu
Hartwig Adam
VLMMoE
291
23
0
10 May 2023
Visual Tuning
Visual TuningACM Computing Surveys (ACM Comput. Surv.), 2023
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
438
60
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them AllComputer Vision and Pattern Recognition (CVPR), 2023
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
553
1,305
0
09 May 2023
Less is More: Removing Text-regions Improves CLIP Training Efficiency
  and Robustness
Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness
Liangliang Cao
Bowen Zhang
Chen Chen
Yinfei Yang
Xianzhi Du
Wen‐Cheng Zhang
Zhiyun Lu
Yantao Zheng
CLIPVLM
183
17
0
08 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured RepresentationsAAAI Conference on Artificial Intelligence (AAAI), 2023
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
308
49
0
06 May 2023
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion
  Synthesis
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion SynthesisIEEE International Conference on Computer Vision (ICCV), 2023
Mathis Petrovich
Michael J. Black
Gül Varol
VGen
346
154
0
02 May 2023
Multimodal Neural Databases
Multimodal Neural DatabasesAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Giovanni Trappolini
Andrea Santilli
Emanuele Rodolà
A. Halevy
Fabrizio Silvestri
246
11
0
02 May 2023
What Do Self-Supervised Vision Transformers Learn?
What Do Self-Supervised Vision Transformers Learn?International Conference on Learning Representations (ICLR), 2023
Namuk Park
Wonjae Kim
Byeongho Heo
Taekyung Kim
Sangdoo Yun
SSL
300
103
1
01 May 2023
Adversarial Representation Learning for Robust Privacy Preservation in
  Audio
Adversarial Representation Learning for Robust Privacy Preservation in AudioIEEE Open Journal of Signal Processing (IEEE Open J. Signal Process.), 2023
Shayan Gharib
Minh Tran
Diep Luong
Konstantinos Drossos
Maria Sandsten
AAML
218
7
0
29 Apr 2023
An Empirical Study of Multimodal Model Merging
An Empirical Study of Multimodal Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yi-Lin Sung
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Joey Tianyi Zhou
Lijuan Wang
MoMe
335
52
0
28 Apr 2023
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
  Quality Assessment
Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Haoning Wu
Liang Liao
Annan Wang
Chaofeng Chen
Jingwen Hou
Wenxiu Sun
Qiong Yan
Weisi Lin
215
15
0
28 Apr 2023
Retrieval-based Knowledge Augmented Vision Language Pre-training
Retrieval-based Knowledge Augmented Vision Language Pre-trainingACM Multimedia (ACM MM), 2023
Jiahua Rao
Zifei Shan
Long Liu
Yao Zhou
Yuedong Yang
VLM
299
22
0
27 Apr 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text
  Matching Models
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Seulki Park
Daeho Um
Hajung Yoon
Sanghyuk Chun
Sangdoo Yun
Hawook Jeong
398
5
0
21 Apr 2023
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
  Training
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Yihao Chen
Xianbiao Qi
Jianan Wang
Lei Zhang
175
24
0
17 Apr 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
396
152
0
17 Apr 2023
Permutation Equivariance of Transformers and Its Applications
Permutation Equivariance of Transformers and Its ApplicationsComputer Vision and Pattern Recognition (CVPR), 2023
Hengyuan Xu
Liyao Xiang
Hang Ye
Dixi Yao
Pengzhi Chu
Baochun Li
326
25
0
16 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
  Segmentation
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic SegmentationChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2023
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
166
14
0
14 Apr 2023
Efficient Multimodal Fusion via Interactive Prompting
Efficient Multimodal Fusion via Interactive PromptingComputer Vision and Pattern Recognition (CVPR), 2023
Yaowei Li
Ruijie Quan
Linchao Zhu
Yezhou Yang
159
62
0
13 Apr 2023
RECLIP: Resource-efficient CLIP by Training with Small Images
RECLIP: Resource-efficient CLIP by Training with Small Images
Runze Li
Dahun Kim
B. Bhanu
Weicheng Kuo
VLMCLIP
264
17
0
12 Apr 2023
Gradient-Free Textual Inversion
Gradient-Free Textual InversionACM Multimedia (ACM MM), 2023
Zhengcong Fei
Mingyuan Fan
Junshi Huang
DiffM
260
38
0
12 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal
  representations
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
116
5
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Improving Image Recognition by Retrieving from Web-Scale Image-Text DataComputer Vision and Pattern Recognition (CVPR), 2023
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM3DV
265
30
0
11 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2023
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
Jing Liu
300
7
0
09 Apr 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal
  Pretraining in E-commerce
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerceComputer Vision and Pattern Recognition (CVPR), 2023
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
168
20
0
06 Apr 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
VicTR: Video-conditioned Text Representations for Activity RecognitionComputer Vision and Pattern Recognition (CVPR), 2023
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
347
36
0
05 Apr 2023
Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep
  Neural Networks
Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural NetworksACM Transactions on Software Engineering and Methodology (TOSEM), 2023
Michael Weiss
Paolo Tonella
AI4CE
191
1
0
05 Apr 2023
ChartReader: A Unified Framework for Chart Derendering and Comprehension
  without Heuristic Rules
ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic RulesIEEE International Conference on Computer Vision (ICCV), 2023
Zhi-Qi Cheng
Qianwen Dai
Siyao Li
Yuxuan Zhou
Teruko Mitamura
Alexander G. Hauptmann
218
28
0
05 Apr 2023
Uncertainty estimation in Deep Learning for Panoptic segmentation
Uncertainty estimation in Deep Learning for Panoptic segmentation
Michael J. Smith
F. Ferrie
OODUQCV
178
0
0
04 Apr 2023
Scalable and Accurate Self-supervised Multimodal Representation Learning
  without Aligned Video and Text Data
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data
Vladislav Lialin
Stephen Rawls
David M. Chan
Shalini Ghosh
Anna Rumshisky
Wael Hamza
VLMAI4TS
267
8
0
04 Apr 2023
Black Box Few-Shot Adaptation for Vision-Language models
Black Box Few-Shot Adaptation for Vision-Language modelsIEEE International Conference on Computer Vision (ICCV), 2023
Yassine Ouali
Adrian Bulat
Brais Martínez
Georgios Tzimiropoulos
VLM
247
45
0
04 Apr 2023
Exploring Vision-Language Models for Imbalanced Learning
Exploring Vision-Language Models for Imbalanced LearningInternational Journal of Computer Vision (IJCV), 2023
Yidong Wang
Zhuohao Yu
Yongfeng Zhang
Qiang Heng
Haoxing Chen
Wei Ye
Rui Xie
Xingxu Xie
Shi-Bo Zhang
VLM
308
52
0
04 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Vision-Language Models for Vision Tasks: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
499
1,014
0
03 Apr 2023
Previous
123...151617...192021
Next