ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLM
    CLIP
    OffRL
ArXivPDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 910 papers shown
Title
TransformerFAM: Feedback attention is working memory
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
27
12
0
14 Apr 2024
AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning
Yuwei Tang
Zhenyi Lin
Qilong Wang
Pengfei Zhu
Qinghua Hu
26
11
0
13 Apr 2024
ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition
Otto Brookes
Majid Mirmehdi
H. Kühl
T. Burghardt
24
3
0
13 Apr 2024
PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical
  Image Classification
PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification
Zhenwei Wang
Qiule Sun
Bingbing Zhang
Pengfei Wang
Jianxin Zhang
Qiang Zhang
VLM
38
1
0
13 Apr 2024
COCONut: Modernizing COCO Segmentation
COCONut: Modernizing COCO Segmentation
XueQing Deng
Qihang Yu
Peng Wang
Xiaohui Shen
Liang-Chieh Chen
32
16
0
12 Apr 2024
Improving Continuous Sign Language Recognition with Adapted Image Models
Improving Continuous Sign Language Recognition with Adapted Image Models
Lianyu Hu
Tongkai Shi
Liqing Gao
Zekang Liu
Wei Feng
VLM
18
5
0
12 Apr 2024
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Tianyu Zhu
M. Jung
Jesse Clark
83
1
0
12 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
50
0
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
42
25
0
10 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large
  Multi-Modal Models
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
14
2
0
09 Apr 2024
Test-Time Zero-Shot Temporal Action Localization
Test-Time Zero-Shot Temporal Action Localization
Benedetta Liberatori
Alessandro Conti
Paolo Rota
Yiming Wang
Elisa Ricci
19
3
0
08 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLM
ObjD
25
7
0
07 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency
  Determines Multimodal Model Performance
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao
Ameya Prabhu
Adhiraj Ghosh
Yash Sharma
Philip H. S. Torr
Adel Bibi
Samuel Albanie
Matthias Bethge
VLM
118
43
0
04 Apr 2024
Foundation Model for Advancing Healthcare: Challenges, Opportunities,
  and Future Directions
Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions
Yuting He
Fuxiang Huang
Xinrui Jiang
Yuxiang Nie
Minghao Wang
Jiguang Wang
Hao Chen
LM&MA
AI4CE
71
26
0
04 Apr 2024
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
Walid Bousselham
Angie Boggust
Sofian Chaybouti
Hendrik Strobelt
Hilde Kuehne
83
10
0
04 Apr 2024
SalFoM: Dynamic Saliency Prediction with Video Foundation Models
SalFoM: Dynamic Saliency Prediction with Video Foundation Models
Morteza Moradi
Mohammad Moradi
Francesco Rundo
C. Spampinato
Ali Borji
S. Palazzo
30
1
0
03 Apr 2024
Segment Any 3D Object with Language
Segment Any 3D Object with Language
Seungjun Lee
Yuyang Zhao
Gim Hee Lee
31
1
0
02 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language
  Models
Iterated Learning Improves Compositionality in Large Vision-Language Models
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLM
CoGe
41
9
0
02 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Jienneg Chen
Qihang Yu
Xiaohui Shen
Alan L. Yuille
Liang-Chieh Chen
3DV
VLM
28
24
0
02 Apr 2024
Fashion Style Editing with Generative Human Prior
Fashion Style Editing with Generative Human Prior
Chaerin Kong
Seungyong Lee
Soohyeok Im
Wonsuk Yang
41
0
0
02 Apr 2024
VLRM: Vision-Language Models act as Reward Models for Image Captioning
VLRM: Vision-Language Models act as Reward Models for Image Captioning
Maksim Dzabraev
Alexander Kunitsyn
Andrei Ivaniuta
VLM
MLLM
21
3
0
02 Apr 2024
Streaming Dense Video Captioning
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
29
30
0
01 Apr 2024
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Agneet Chatterjee
Gabriela Ben-Melech Stan
Estelle Aflalo
Sayak Paul
Dhruba Ghosh
...
Ludwig Schmidt
Hanna Hajishirzi
Vasudev Lal
Chitta Baral
Yezhou Yang
EGVM
VLM
57
14
0
01 Apr 2024
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields
Yunsong Wang
Hanlin Chen
Gim Hee Lee
27
5
0
01 Apr 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLM
LRM
31
32
0
28 Mar 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim M. Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
40
5
0
28 Mar 2024
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for
  Vision-Language Models
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models
Saurav Jha
Dong Gong
Lina Yao
CLIP
VLM
33
7
0
28 Mar 2024
Language Plays a Pivotal Role in the Object-Attribute Compositional
  Generalization of CLIP
Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP
Reza Abbasi
Mohammad Samiei
M. Rohban
M. Baghshah
VLM
CoGe
20
0
0
27 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering
  Using a VLM
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
43
50
0
27 Mar 2024
Residual-based Language Models are Free Boosters for Biomedical Imaging
Residual-based Language Models are Free Boosters for Biomedical Imaging
Zhixin Lai
Jing Wu
Suiyao Chen
Yucheng Zhou
N. Hovakimyan
MedIm
27
26
0
26 Mar 2024
DreamLIP: Language-Image Pre-training with Long Captions
DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng
Yifei Zhang
Wei Wu
Fan Lu
Shuailei Ma
Xin Jin
Wei Chen
Yujun Shen
VLM
CLIP
32
23
0
25 Mar 2024
Open-Set Recognition in the Age of Vision-Language Models
Open-Set Recognition in the Age of Vision-Language Models
Dimity Miller
Niko Sünderhauf
Alex Kenna
Keita Mason
VLM
30
3
0
25 Mar 2024
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yuchen Suo
Fan Ma
Linchao Zhu
Yi Yang
27
18
0
24 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
27
44
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul M. Chilimbi
VLM
AI4TS
43
4
0
21 Mar 2024
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Few-Shot Adversarial Prompt Learning on Vision-Language Models
Yiwei Zhou
Xiaobo Xia
Zhiwei Lin
Bo Han
Tongliang Liu
VLM
34
10
0
21 Mar 2024
MyVLM: Personalizing VLMs for User-Specific Queries
MyVLM: Personalizing VLMs for User-Specific Queries
Yuval Alaluf
Elad Richardson
Sergey Tulyakov
Kfir Aberman
Daniel Cohen-Or
MLLM
VLM
36
18
0
21 Mar 2024
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship
  Detection
Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
Tim Salzmann
Markus Ryll
Alex Bewley
Matthias Minderer
38
4
0
21 Mar 2024
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding
Jingjing Hu
Dan Guo
Kun Li
Zhan Si
Xun Yang
Xiaojun Chang
Meng Wang
59
3
0
21 Mar 2024
MTP: Advancing Remote Sensing Foundation Model via Multi-Task
  Pretraining
MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining
Di Wang
Jing Zhang
Minqiang Xu
Lin Liu
Dongsheng Wang
...
Chengxi Han
Haonan Guo
Bo Du
Dacheng Tao
L. Zhang
29
42
0
20 Mar 2024
Vid2Robot: End-to-end Video-conditioned Policy Learning with
  Cross-Attention Transformers
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Vidhi Jain
Maria Attarian
Nikhil J. Joshi
Ayzaan Wahid
Danny Driess
...
Stefan Welker
Christine Chan
Igor Gilitschenski
Yonatan Bisk
Debidatta Dwibedi
68
27
0
19 Mar 2024
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT
  Adaptation
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation
Wangbo Zhao
Jiasheng Tang
Yizeng Han
Yibing Song
Kai Wang
Gao Huang
F. Wang
Yang You
35
11
0
18 Mar 2024
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
EffiVED:Efficient Video Editing via Text-instruction Diffusion Models
Zhenghao Zhang
Zuozhuo Dai
Long Qin
Weizhi Wang
DiffM
VGen
34
2
0
18 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Generative Region-Language Pretraining for Open-Ended Object Detection
Chuang Lin
Yi-Xin Jiang
Lizhen Qu
Zehuan Yuan
Jianfei Cai
ObjD
VLM
38
13
0
15 Mar 2024
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive
  Language-Image Pre-training
RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training
Zhixiu Lu
Hailong Li
Lili He
VLM
MedIm
27
0
0
15 Mar 2024
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via
  Concept-guided Context Optimization
XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization
Yequan Bie
Luyang Luo
Zhixuan Chen
Hao Chen
34
7
0
14 Mar 2024
Decomposing Disease Descriptions for Enhanced Pathology Detection: A
  Multi-Aspect Vision-Language Pre-training Framework
Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework
Vu Minh Hieu Phan
Yutong Xie
Yuankai Qi
Lingqiao Liu
Liyang Liu
Bowen Zhang
Zhibin Liao
Qi Wu
Minh Nguyen Nhat To
Johan W. Verjans
48
11
0
12 Mar 2024
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized
  Visual Class Discovery
Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery
Haiyang Zheng
Nan Pu
Wenjing Li
N. Sebe
Zhun Zhong
39
7
0
12 Mar 2024
Previous
123...567...171819
Next