ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
SalFoM: Dynamic Saliency Prediction with Video Foundation Models
SalFoM: Dynamic Saliency Prediction with Video Foundation ModelsInternational Conference on Pattern Recognition (ICPR), 2024
Morteza Moradi
Mohammad Moradi
Francesco Rundo
C. Spampinato
Ali Borji
S. Palazzo
231
3
0
03 Apr 2024
Segment Any 3D Object with Language
Segment Any 3D Object with LanguageInternational Conference on Learning Representations (ICLR), 2024
Seungjun Lee
Yuyang Zhao
Gim Hee Lee
266
7
0
02 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language
  Models
Iterated Learning Improves Compositionality in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLMCoGe
259
16
0
02 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language EraComputer Vision and Pattern Recognition (CVPR), 2024
Jienneg Chen
Qihang Yu
Xiaohui Shen
Yaoyao Liu
Liang-Chieh Chen
3DVVLM
415
50
0
02 Apr 2024
Fashion Style Editing with Generative Human Prior
Fashion Style Editing with Generative Human Prior
Chaerin Kong
Seungyong Lee
Soohyeok Im
Wonsuk Yang
305
0
0
02 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLMMLLM
186
16
0
02 Apr 2024
Streaming Dense Video Captioning
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
253
73
0
01 Apr 2024
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Agneet Chatterjee
Gabriela Ben-Melech Stan
Estelle Aflalo
Sayak Paul
Dhruba Ghosh
...
Ludwig Schmidt
Hanna Hajishirzi
Vasudev Lal
Chitta Baral
Yezhou Yang
EGVMVLM
306
25
0
01 Apr 2024
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
Yunsong Wang
Hanlin Chen
Gim Hee Lee
250
9
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLMLRM
301
73
0
28 Mar 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
268
10
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
376
21
0
28 Mar 2024
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for
  Vision-Language Models
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models
Saurav Jha
Dong Gong
Lina Yao
CLIPVLM
445
18
0
28 Mar 2024
Language Plays a Pivotal Role in the Object-Attribute Compositional
  Generalization of CLIP
Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Reza Abbasi
Mohammad Samiei
M. Rohban
M. Baghshah
VLMCoGe
229
0
0
27 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering
  Using a VLM
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
229
81
0
27 Mar 2024
Residual-based Language Models are Free Boosters for Biomedical Imaging
Residual-based Language Models are Free Boosters for Biomedical Imaging
Zhixin Lai
Jing Wu
Suiyao Chen
Yucheng Zhou
N. Hovakimyan
MedIm
399
35
0
26 Mar 2024
DreamLIP: Language-Image Pre-training with Long Captions
DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng
Yifei Zhang
Wei Wu
Fan Lu
Shuailei Ma
Xin Jin
Wei Chen
Yujun Shen
VLMCLIP
307
62
0
25 Mar 2024
Open-Set Recognition in the Age of Vision-Language Models
Open-Set Recognition in the Age of Vision-Language Models
Dimity Miller
Niko Sünderhauf
Alex Kenna
Keita Mason
VLM
251
10
0
25 Mar 2024
Determined Multi-Label Learning via Similarity-Based Prompt
Determined Multi-Label Learning via Similarity-Based Prompt
Meng Wei
Zhongnian Li
Peng Ying
Yong Zhou
Xinzheng Xu
217
0
0
25 Mar 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Knowledge-Enhanced Dual-stream Zero-shot Composed Image RetrievalComputer Vision and Pattern Recognition (CVPR), 2024
Yuchen Suo
Fan Ma
Linchao Zhu
Yi Yang
241
42
0
24 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
261
104
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at ScaleComputer Vision and Pattern Recognition (CVPR), 2024
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLMAI4TS
224
8
0
21 Mar 2024
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Yiwei Zhou
Xiaobo Xia
Zhiwei Lin
Bo Han
Tongliang Liu
VLM
209
30
0
21 Mar 2024
MyVLM: Personalizing VLMs for User-Specific Queries
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf
Elad Richardson
Sergey Tulyakov
Kfir Aberman
Daniel Cohen-Or
MLLMVLM
309
42
0
21 Mar 2024
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship
  Detection
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Tim Salzmann
Markus Ryll
Alex Bewley
Matthias Minderer
281
7
0
21 Mar 2024
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Jingjing Hu
Dan Guo
Kun Li
Zhan Si
Xun Yang
Xiaojun Chang
Meng Wang
301
20
0
21 Mar 2024
MTP: Advancing Remote Sensing Foundation Model via Multi-Task
  Pretraining
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining
Di Wang
Jing Zhang
Minqiang Xu
Lin Liu
Dongsheng Wang
...
Chengxi Han
Haonan Guo
Bo Du
Dacheng Tao
Guang Dai
235
100
0
20 Mar 2024
Vid2Robot: End-to-end Video-conditioned Policy Learning with
  Cross-Attention Transformers
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Vidhi Jain
Maria Attarian
Nikhil J. Joshi
Ayzaan Wahid
Danny Driess
...
Stefan Welker
Christine Chan
Igor Gilitschenski
Yonatan Bisk
Debidatta Dwibedi
326
48
0
19 Mar 2024
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT
  Adaptation
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT AdaptationNeural Information Processing Systems (NeurIPS), 2024
Wangbo Zhao
Jiasheng Tang
Yizeng Han
Yibing Song
Kai Wang
Gao Huang
F. Wang
Yang You
323
23
0
18 Mar 2024
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
Zhenghao Zhang
Zuozhuo Dai
Long Qin
Weizhi Wang
DiffMVGen
262
11
0
18 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object DetectionComputer Vision and Pattern Recognition (CVPR), 2024
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjDVLM
222
27
0
15 Mar 2024
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-trainingIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2024
Zhixiu Lu
Hailong Li
N. Parikh
Jonathan R. Dillman
Lili He
MedImVLM
464
7
0
15 Mar 2024
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via
  Concept-guided Context Optimization
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context OptimizationInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024
Yequan Bie
Luyang Luo
Zhixuan Chen
Hao Chen
191
19
0
14 Mar 2024
Can We Talk Models Into Seeing the World Differently?
Can We Talk Models Into Seeing the World Differently?International Conference on Learning Representations (ICLR), 2024
Paul Gavrikov
Jovita Lukasik
Steffen Jung
Robert Geirhos
Bianca Lamm
Muhammad Jehanzeb Mirza
Margret Keuper
VLM
251
4
0
14 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A
  Multi-Aspect Vision-Language Pre-training Framework
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training FrameworkComputer Vision and Pattern Recognition (CVPR), 2024
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh-Son To
Johan Verjans
362
28
0
12 Mar 2024
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
  Visual Class Discovery
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class DiscoveryEuropean Conference on Computer Vision (ECCV), 2024
Haiyang Zheng
Nan Pu
Wenjing Li
Andrii Zadaianchuk
Zhun Zhong
252
15
0
12 Mar 2024
QUASAR: QUality and Aesthetics Scoring with Advanced Representations
QUASAR: QUality and Aesthetics Scoring with Advanced RepresentationsIEEE Access (IEEE Access), 2024
Sergey Kastryulin
Denis Prokopenko
Artem Babenko
Dmitry V. Dylov
257
0
0
11 Mar 2024
RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
RESTORE: Towards Feature Shift for Vision-Language Prompt Learning
Yuncheng Yang
Chuyan Zhang
Zuopeng Yang
Yuting Gao
Yulei Qin
Ke Li
Xing Sun
Jie Yang
Yun Gu
VLMVPVLM
325
0
0
10 Mar 2024
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?International Conference on Learning Representations (ICLR), 2024
Ibrahim Alabdulmohsin
Xiao Wang
Andreas Steiner
Priya Goyal
Alexander DÁmour
Xiao-Qi Zhai
211
30
0
07 Mar 2024
Modeling Collaborator: Enabling Subjective Vision Classification With
  Minimal Human Effort via LLM Tool-Use
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
Imad Eddine Toubal
Aditya Avinash
N. Alldrin
Jan Dlabal
Wenlei Zhou
...
Chun-Ta Lu
Howard Zhou
Ranjay Krishna
Ariel Fuxman
Tom Duerig
VLM
321
20
0
05 Mar 2024
HeAR -- Health Acoustic Representations
HeAR -- Health Acoustic Representations
Sebastien Baur
Zaid Nabulsi
Wei-Hung Weng
Jake Garrison
Louis Blankemeier
...
Shwetak N. Patel
S. Shetty
Shruthi Prabhakara
Monde Muyoyeta
Diego Ardila
LM&MA
237
26
0
04 Mar 2024
Differentially Private Representation Learning via Image Captioning
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi-An Ma
Kamalika Chaudhuri
Chuan Guo
276
7
0
04 Mar 2024
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary
  Action Recognition
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition
Kun-Yu Lin
Henghui Ding
Jiaming Zhou
Yu-Ming Tang
Yi-Xing Peng
Zhilin Zhao
Chen Change Loy
Wei-Shi Zheng
VLM
344
21
0
03 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
369
342
0
29 Feb 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the
  Open World
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
318
85
0
29 Feb 2024
SeD: Semantic-Aware Discriminator for Image Super-Resolution
SeD: Semantic-Aware Discriminator for Image Super-Resolution
Bingchen Li
Xin Li
Hanxin Zhu
Yeying Jin
Ruoyu Feng
Zhizheng Zhang
Zhibo Chen
SupR
219
42
0
29 Feb 2024
Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy,
  Advances, and Outlook
Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook
Xingchen Zou
Yibo Yan
Xixuan Hao
Yuehong Hu
Haomin Wen
...
Junbo Zhang
Yong Li
Tianrui Li
Yu Zheng
Yuxuan Liang
HAIAI4TS
324
84
0
29 Feb 2024
Generalizable Whole Slide Image Classification with Fine-Grained
  Visual-Semantic Interaction
Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
Hao Li
Ying Chen
Yifei Chen
Wenxian Yang
Bowen Ding
Yuchen Han
Liansheng Wang
Rongshan Yu
269
31
0
29 Feb 2024
MMSR: Symbolic Regression is a Multimodal Task
MMSR: Symbolic Regression is a Multimodal Task
Yanjie Li
Jingyi Liu
Weijun Li
Lina Yu
Min Wu
Wenqiang Li
Meilan Hao
Su Wei
Yusong Deng
234
4
0
28 Feb 2024
Sora: A Review on Background, Technology, Limitations, and Opportunities
  of Large Vision Models
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu
Kai Zhang
Yuan Li
Zhiling Yan
Chujie Gao
...
Yue Huang
Hanchi Sun
Jianfeng Gao
Lifang He
Lichao Sun
VLMVGenEGVM
424
493
0
27 Feb 2024
Previous
123...8910...192021
Next