ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
Deep Correlated Prompting for Visual Recognition with Missing Modalities
Deep Correlated Prompting for Visual Recognition with Missing ModalitiesNeural Information Processing Systems (NeurIPS), 2024
Lianyu Hu
Tongkai Shi
Wei Feng
Fanhua Shang
Liang Wan
VLM
462
12
0
09 Oct 2024
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation
  Models
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation ModelsAsian Conference on Computer Vision (ACCV), 2024
Rabin Adhikari
Safal Thapaliya
Manish Dhakal
Bishesh Khanal
MLLMVLM
286
2
0
07 Oct 2024
Uncertainty-Guided Enhancement on Driving Perception System via
  Foundation Models
Uncertainty-Guided Enhancement on Driving Perception System via Foundation ModelsIEEE International Conference on Robotics and Automation (ICRA), 2024
Yunhao Yang
Yuxin Hu
Mao Ye
Zaiwei Zhang
Zhichao Lu
Yi Xu
Ufuk Topcu
Ben Snyder
252
4
0
02 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
395
2
0
01 Oct 2024
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge AugmentationNeural Information Processing Systems (NeurIPS), 2024
Kun Yuan
V. Srivastav
Nassir Navab
N. Padoy
404
24
0
30 Sep 2024
FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image
  Classification
FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image ClassificationNeural Information Processing Systems (NeurIPS), 2024
Kexue Fu
Xiaoyuan Luo
Linhao Qu
Shuo Wang
Ying Xiong
Ilias Maglogiannis
Longxiang Gao
Manning Wang
198
6
0
29 Sep 2024
Vision-Language Models are Strong Noisy Label Detectors
Vision-Language Models are Strong Noisy Label DetectorsNeural Information Processing Systems (NeurIPS), 2024
Tong Wei
Haoyang Li
Chun-Shu Li
Jiang-Xin Shi
Yu-Feng Li
Min-Ling Zhang
VLM
214
14
0
29 Sep 2024
From Vision to Audio and Beyond: A Unified Model for Audio-Visual
  Representation and Generation
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and GenerationInternational Conference on Machine Learning (ICML), 2024
Kun Su
Xiulong Liu
Eli Shlizerman
VGen
416
15
0
27 Sep 2024
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation
  of Visual-knowledge-linguistic Features
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features
Xin Wei
Yaling Tao
Changde Du
Gangming Zhao
Yizhou Yu
Jinpeng Li
218
0
0
24 Sep 2024
LARE: Latent Augmentation using Regional Embedding with Vision-Language
  Model
LARE: Latent Augmentation using Regional Embedding with Vision-Language ModelMachine Learning with Applications (MLWA), 2024
Kosuke Sakurai
Tatsuya Ishii
Ryotaro Shimizu
Linxin Song
Masayuki Goto
VLM
247
1
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
422
5
0
19 Sep 2024
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human MotionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Kalakonda Sai Shashank
Shubh Maheshwari
Ravi Kiran Sarvadevabhatla
VGenDiffM
293
6
0
18 Sep 2024
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image Retrieval
Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image RetrievalEngineering applications of artificial intelligence (EAAI), 2024
Amirreza Mahbod
Nematollah Saeidi
Sepideh Hatamikia
Ramona Woitek
VLMMedIm
343
13
0
14 Sep 2024
Phikon-v2, A large and public feature extractor for biomarker prediction
Phikon-v2, A large and public feature extractor for biomarker prediction
Alexandre Filiot
Paul Jacob
Alice Mac Kain
Charlie Saillard
MedIm
256
64
0
13 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGeVLM
210
1
0
12 Sep 2024
Recent Trends of Multimodal Affective Computing: A Survey from NLP
  Perspective
Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective
Guimin Hu
Yi Xin
Weimin Lyu
Haojian Huang
Chang Sun
Zehan Zhu
Lin Gui
Ruichu Cai
Erik Cambria
Hasti Seifi
375
15
0
11 Sep 2024
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront SchedulingInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Yujie Wang
Shenhan Zhu
Fangcheng Fu
Xupeng Miao
Jie Zhang
Juan Zhu
Fan Hong
Yongbin Li
Bin Cui
160
0
0
05 Sep 2024
CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS
  Differently
CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently
Jonathan Zalach
Inbal Gazy
Assaf Avinoam
Ron Sinai
Eran Shmuel
Inbar Gilboa
Christine Swisher
Naim Matasci
Reva Basho
David B. Agus
182
0
0
04 Sep 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur
Darshan Singh
Makarand Tapaswi
939
2
0
04 Sep 2024
Rethinking Sparse Lexical Representations for Image Retrieval in the Age
  of Rising Multi-Modal Large Language Models
Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models
K. Nakata
Daisuke Miyashita
Youyang Ng
Yasuto Hoshi
J. Deguchi
163
1
0
29 Aug 2024
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models
RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language ModelsIsprs Journal of Photogrammetry and Remote Sensing (ISPRS J. Photogramm. Remote Sens.), 2024
Junyao Ge
Xu Zhang
Yang Zheng
Kaitai Guo
Jimin Liang
620
6
0
27 Aug 2024
A New Era in Computational Pathology: A Survey on Foundation and
  Vision-Language Models
A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models
Dibaloke Chanda
Milan Aryal
Nasim Yahya Soltani
Masoud Ganji
AI4CEVLM
428
11
0
23 Aug 2024
Has Multimodal Learning Delivered Universal Intelligence in Healthcare?
  A Comprehensive Survey
Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive SurveyInformation Fusion (Inf. Fusion), 2024
Qika Lin
Yifan Zhu
Xin Mei
Ling Huang
Jingying Ma
Kai He
Zhen Peng
Xiaoshi Zhong
Mengling Feng
293
62
0
23 Aug 2024
XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary
  Classification of Chest X-Rays
XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-RaysMachine Learning in Health Care (MLHC), 2024
Umaima Rahman
Abhishek Basu
Muhammad Uzair Khattak
Aniq Ur Rahman
MedIm
213
0
0
21 Aug 2024
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared
  Person Re-Identification
WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-IdentificationEuropean Conference on Computer Vision (ECCV), 2024
Yonggan Wu
Ling-Chao Meng
Yuan Zichao
Sixian Chan
Hong-Qiang Wang
263
8
0
20 Aug 2024
C${^2}$RL: Content and Context Representation Learning for Gloss-free
  Sign Language Translation and Retrieval
C2{^2}2RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval
Zhigang Chen
Benjia Zhou
Yiqing Huang
Jun Wan
Yibo Hu
Hailin Shi
Yanyan Liang
Zhen Lei
Du Zhang
VLMSLR
197
11
0
19 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language
  Compositionality
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
194
0
0
18 Aug 2024
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
CROME: Cross-Modal Adapters for Efficient Multimodal LLM
Sayna Ebrahimi
Sercan O. Arik
Tejas Nama
Tomas Pfister
188
4
0
13 Aug 2024
Contrastive masked auto-encoders based self-supervised hashing for 2D
  image and 3D point cloud cross-modal retrieval
Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrievalIEEE International Conference on Multimedia and Expo (ICME), 2024
Rukai Wei
Heng Cui
Yu Liu
Yufeng Hou
Yanzhao Xie
Ke Zhou
3DPC
188
0
0
11 Aug 2024
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic
  Segmentation
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationEuropean Conference on Computer Vision (ECCV), 2024
Dahyun Kang
Minsu Cho
ObjDVLM
390
24
0
09 Aug 2024
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
  Scaling
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond ScalingNeural Information Processing Systems (NeurIPS), 2024
Haider Al-Tahan
Q. Garrido
Randall Balestriero
Diane Bouchacourt
C. Hazirbas
Mark Ibrahim
VLM
284
23
0
09 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language
  Modeling
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingEuropean Conference on Computer Vision (ECCV), 2024
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas Guibas
P. Milanfar
Feng Yang
341
2
0
07 Aug 2024
Multistain Pretraining for Slide Representation Learning in Pathology
Multistain Pretraining for Slide Representation Learning in PathologyEuropean Conference on Computer Vision (ECCV), 2024
Guillaume Jaume
Anurag J. Vaidya
Andrew Zhang
Andrew H. Song
Richard J. Chen
S. Sahai
Dandan Mo
Emilio Madrigal
L. Le
Faisal Mahmood
236
25
0
05 Aug 2024
Text-Guided Video Masked Autoencoder
Text-Guided Video Masked AutoencoderEuropean Conference on Computer Vision (ECCV), 2024
D. Fan
Jue Wang
Shuai Liao
Zhikang Zhang
Vimal Bhat
Xinyu Li
VGen
167
7
0
01 Aug 2024
Conditioned Prompt-Optimization for Continual Deepfake Detection
Conditioned Prompt-Optimization for Continual Deepfake Detection
Francesco Laiti
Benedetta Liberatori
Thomas De Min
Elisa Ricci
318
7
0
31 Jul 2024
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language
  Models
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models
Ali Abdollahi
Mahdi Ghaznavi
Mohammad Reza Karimi Nejad
Arash Mari Oriyad
Reza Abbasi
Ali Salesi
Melika Behjati
M. Rohban
M. Baghshah
CoGe
401
3
0
30 Jul 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music
  Descriptions
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Yatian Wang
Yatian Wang
Aosong Cheng
Pengjun Fang
Zeyue Tian
...
Wenhan Luo
Qifeng Chen
Shanghang Zhang
Qi-fei Liu
Yi-Ting Guo
301
8
0
30 Jul 2024
Look Hear: Gaze Prediction for Speech-directed Human Attention
Look Hear: Gaze Prediction for Speech-directed Human AttentionEuropean Conference on Computer Vision (ECCV), 2024
Sounak Mondal
Seoyoung Ahn
Zhibo Yang
Niranjan Balasubramanian
Dimitris Samaras
G. Zelinsky
Minh Hoai
409
3
0
28 Jul 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
Biao Wu
Yutong Xie
Zeyu Zhang
Minh Hieu Phan
Qi Chen
Ling-Hao Chen
Qi Wu
LM&MA
237
9
0
28 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language
  Alignment
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng Zhang
Tong He
VLM
211
7
0
25 Jul 2024
QPT V2: Masked Image Modeling Advances Visual Scoring
QPT V2: Masked Image Modeling Advances Visual Scoring
Qizhi Xie
Kun Yuan
Yunpeng Qu
Mingda Wu
Ming Sun
Chao Zhou
Jihong Zhu
236
6
0
23 Jul 2024
Improved Few-Shot Image Classification Through Multiple-Choice Questions
Improved Few-Shot Image Classification Through Multiple-Choice Questions
Dipika Khullar
Emmett Goodman
Negin Sokhandan
151
2
0
23 Jul 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with
  Extensive Diversity
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Yangzhou Liu
Yue Cao
Zhangwei Gao
Weiyun Wang
Zhe Chen
...
Lewei Lu
Xizhou Zhu
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
313
41
0
22 Jul 2024
In-Context Learning Improves Compositional Understanding of
  Vision-Language Models
In-Context Learning Improves Compositional Understanding of Vision-Language Models
Matteo Nulli
Anesa Ibrahimi
Avik Pal
Hoshe Lee
Ivona Najdenkoska
VLMCoGe
193
0
0
22 Jul 2024
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models
Laura Niss
Kevin Vogt-Lowell
Theodoros Tsiligkaridis
VLM
306
1
0
22 Jul 2024
A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model
Yingxue Xu
Yihui Wang
Fengtao Zhou
Jiabo Ma
Shu Yang
...
Anjia Han
Ronald Cheong Kin Chan
Li Liang
Xiuming Zhang
Hao Chen
436
45
0
22 Jul 2024
Large-vocabulary forensic pathological analyses via prototypical
  cross-modal contrastive learning
Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning
Chen Shen
Chunfeng Lian
Wanqing Zhang
Fan Wang
Jianhua Zhang
...
Hongshu Mu
Hao Wu
Xinggong Liang
Jianhua Ma
Zhenyuan Wang
209
5
0
20 Jul 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
Multimodal Label Relevance Ranking via Reinforcement Learning
Taian Guo
Taolin Zhang
Haoqian Wu
Hanjun Li
Ruizhi Qiao
Xing Sun
OffRL
189
1
0
18 Jul 2024
ViLLa: Video Reasoning Segmentation with Large Language Model
ViLLa: Video Reasoning Segmentation with Large Language Model
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
VOSLRM
507
16
0
18 Jul 2024
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language
  Inference
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan
Chaofeng Chen
Yiping Ke
Xinjiang Wang
Xue Jiang
Wayne Zhang
VLM
332
68
0
17 Jul 2024
Previous
123...567...192021
Next
Page 6 of 21
Pageof 21