ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
Title
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
Xieji Li
Siyuan Yan
Yingsheng Liu
H. Soyer
Monika Janda
Victoria Mar
Zongyuan Ge
MedIm
280
0
0
03 Dec 2025
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Masaki Kawamura
Nakamasa Inoue
Rintaro Yanagi
Hirokatsu Kataoka
Rio Yokota
CLIPVLM
149
0
0
28 Nov 2025
Scaling Foundation Models for Radar Scene Understanding
Scaling Foundation Models for Radar Scene Understanding
Pushkal Mishra
Kshitiz Bansal
Dinesh Bharadia
215
0
0
26 Nov 2025
Advancing Image Classification with Discrete Diffusion Classification Modeling
Advancing Image Classification with Discrete Diffusion Classification Modeling
Omer Belhasin
Shelly Golan
Ran El-Yaniv
Michael Elad
DiffM
198
0
0
25 Nov 2025
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
ReMatch: Boosting Representation through Matching for Multimodal Retrieval
Qianying Liu
Xiao Liang
Zhiqiang Zhang
Zhongfei Qing
Fengfan Zhou
Y. Chen
Xu Tang
Yao Hu
Paul Henderson
207
0
0
24 Nov 2025
BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks
Samuel Stevens
153
0
0
20 Nov 2025
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
Wenxin Zhu
Andong Chen
Yuchen Song
Kehai Chen
Conghui Zhu
Ziyan Chen
Tiejun Zhao
LRM
434
0
0
17 Nov 2025
MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images
MergeSlide: Continual Model Merging and Task-to-Class Prompt-Aligned Inference for Lifelong Learning on Whole Slide Images
Doanh C. Bui
Ba-Hung Ngo
H. Pham
Khang Phuoc-Quy Nguyen
Maï K. Nguyen
Y. Nakashima
CLLMoMeVLM
294
0
0
17 Nov 2025
Uni-Hema: Unified Model for Digital Hematopathology
Uni-Hema: Unified Model for Digital Hematopathology
Abdul Rehman
Iqra Rasool
Ayisha Imran
Mohsen Ali
Waqas Sultani
VLM
140
0
0
17 Nov 2025
Medical Knowledge Intervention Prompt Tuning for Medical Image Classification
Medical Knowledge Intervention Prompt Tuning for Medical Image ClassificationIEEE Transactions on Medical Imaging (IEEE TMI), 2025
Ye Du
Nanxi Yu
Shujun Wang
LM&MAVLM
192
1
0
16 Nov 2025
From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology
From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology
Zhenhao Guo
Rachit Saluja
Tianyuan Yao
Quan Liu
Junchao Zhu
...
Steven Salvatoree
Surya Seshane
M. Sabuncu
Yihe Yang
Ruining Deng
VLM
120
0
0
15 Nov 2025
Learning with Preserving for Continual Multitask Learning
Learning with Preserving for Continual Multitask Learning
H. Wang
Siwoo Bae
Zirong Chen
Meiyi Ma
CLL
172
0
0
11 Nov 2025
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking
Wenyuan Yang
Yichen Sun
Changzheng Chen
Zhixuan Chu
Jiaheng Zhang
Yiming Li
Dacheng Tao
AAML
100
0
0
05 Nov 2025
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Tianfan Peng
Yuntao Du
Pengzhou Ji
Shijie Dong
Kailin Jiang
...
Jinhe Bi
Qian Li
Wei Du
Feng Xiao
Lizhen Cui
VLM
260
0
0
04 Nov 2025
SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
Wenbo Lu
CLIPVLM
193
0
0
04 Nov 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
148
0
0
31 Oct 2025
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Zimeng Huang
Jinxin Ke
Xiaoxuan Fan
Yufeng Yang
Yang Liu
...
Junteng Dai
Haoyi Jiang
Y. Zhou
Keze Wang
Z. Chen
LRMVLM
323
0
0
30 Oct 2025
Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai
Jhayahgrit Thongwat
Romrawin Chumpu
Patomporn Payoungkhamdee
Sarana Nutanong
Peerat Limkonchotiwat
VLM
142
0
0
30 Oct 2025
[De|Re]constructing VLMs' Reasoning in Counting
[De|Re]constructing VLMs' Reasoning in Counting
Simone Alghisi
Gabriel Roccabruna
Massimo Rizzoli
Seyed Mahed Mousavi
Giuseppe Riccardi
ReLMLRMVLM
198
1
0
22 Oct 2025
A Matter of Time: Revealing the Structure of Time in Vision-Language Models
A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Nidham Tekaya
Manuela Waldner
Matthias Zeppelzauer
VLM
108
0
0
22 Oct 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
124
0
0
21 Oct 2025
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Yongmin Lee
Hye Won Chung
141
0
0
21 Oct 2025
Calibrated Principal Component Regression
Calibrated Principal Component Regression
Yixuan Florence Wu
Yilun Zhu
Lei Cao and
Naichen Shi
86
0
0
21 Oct 2025
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Enhancing Compositional Reasoning in CLIP via Reconstruction and Alignment of Text Descriptions
Jihoon Kwon
Kyle Min
Jy-yong Sohn
CoGe
168
0
0
18 Oct 2025
Self-Augmented Visual Contrastive Decoding
Self-Augmented Visual Contrastive Decoding
Eun Woo Im
M. K. Ali
Vivek Gupta
125
0
0
15 Oct 2025
End-to-End Multi-Modal Diffusion Mamba
End-to-End Multi-Modal Diffusion Mamba
Chunhao Lu
Qiang Lu
Meichen Dong
Jake Luo
126
3
0
15 Oct 2025
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
Zhengrong Yue
H. Zhang
Xiangyu Zeng
Boyu Chen
Chenting Wang
...
Lu Dong
Kunpeng Du
Yi Wang
Limin Wang
Yali Wang
176
7
0
12 Oct 2025
From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology
From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology
Yizhi Wang
Li Chen
Qiang Huang
Tian Guan
Xi Deng
...
Zhen Song
Xilong Zhao
Chao-Peng He
Ming Zhao
Yonghong He
100
0
0
11 Oct 2025
Vision Language Models: A Survey of 26K Papers
Vision Language Models: A Survey of 26K Papers
Fengming Lin
3DVVLM
121
0
0
10 Oct 2025
Approximate Domain Unlearning for Vision-Language Models
Approximate Domain Unlearning for Vision-Language Models
Kodai Kawamura
Yuta Goto
Rintaro Yanagi
Hirokatsu Kataoka
Go Irie
MUVLM
185
0
0
09 Oct 2025
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Mitchell Keren Taraday
Shahaf Wagner
Chaim Baskin
VLM
105
1
0
08 Oct 2025
AgentDR Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
AgentDR Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang
Nurendra Choudhary
Jiangshu Du
Edward W.Huang
Philip S.Yu
Karthik Subbian
Danai Kourta
140
0
0
07 Oct 2025
Assessing Foundation Models for Mold Colony Detection with Limited Training Data
Assessing Foundation Models for Mold Colony Detection with Limited Training Data
Henrik Pichler
Janis Keuper
Matthew Copping
79
0
0
01 Oct 2025
Are Time Series Foundation Models Susceptible to Catastrophic Forgetting?
Are Time Series Foundation Models Susceptible to Catastrophic Forgetting?
Nouha Karaouli
Denis Coquenet
Elisa Fromont
Martial Mermillod
M. Reyboz
AI4TSAAMLAI4CE
132
0
0
01 Oct 2025
Generalized Contrastive Learning for Universal Multimodal Retrieval
Generalized Contrastive Learning for Universal Multimodal Retrieval
Jungsoo Lee
Janghoon Cho
Hyojin Park
Munawar Hayat
Kyuwoong Hwang
Fatih Porikli
Sungha Choi
VLM
180
2
0
30 Sep 2025
Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline
Haiyang Li
Yaxiong Wang
Lianwei Wu
Lechao Cheng
Lechao Cheng
Zhun Zhong
182
1
0
30 Sep 2025
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization
Jiehui Luo
Yuguo Yin
Yuxin Xie
Jinghan Ru
Xianwei Zhuang
Minghua He
Aofan Liu
Zihan Xiong
Dongchao Yang
96
2
0
25 Sep 2025
Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction
Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and PredictionInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Yi Gu
Kuniaki Saito
Jiaxin Ma
124
0
0
22 Sep 2025
Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation
Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation
Weimin Bai
Yubo Li
Weijian Luo
Wenzheng Chen
He Sun
173
2
0
19 Sep 2025
Region-Aware Deformable Convolutions
Region-Aware Deformable Convolutions
Abolfazl Saheban Maleki
Maryam Imani
138
0
0
18 Sep 2025
Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks
Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks
Yannis Kaltampanidis
Alexandros Doumanoglou
D. Zarpalas
136
0
0
18 Sep 2025
Deep Learning-Driven Peptide Classification in Biological Nanopores
Deep Learning-Driven Peptide Classification in Biological Nanopores
S. Tovey
Julian Hoßbach
Sandro Kuppel
Tobias Ensslen
Jan C. Behrends
Christian Holm
97
0
0
17 Sep 2025
AToken: A Unified Tokenizer for Vision
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
228
7
0
17 Sep 2025
Maps for Autonomous Driving: Full-process Survey and Frontiers
Maps for Autonomous Driving: Full-process Survey and Frontiers
Pengxin Chen
Zhipeng Luo
Xiaoqi Jiang
Zhangcai Yin
Jonathan Li
128
0
0
16 Sep 2025
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
Benjamin Shiue-Hal Chou
Purvish Jajal
Nick Eliopoulos
James C. Davis
George K. Thiruvathukal
Kristen Yeon-Ji Yun
Yung-Hsiang Lu
132
0
0
16 Sep 2025
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Tim Lebailly
Vijay Veerabadran
Satwik Kottur
Karl Ridgeway
Michael L. Iuzzolino
VLM
91
0
0
15 Sep 2025
Robustifying Diffusion-Denoised Smoothing Against Covariate Shift
Robustifying Diffusion-Denoised Smoothing Against Covariate Shift
Ali Hedayatnia
Mostafa Tavassolipour
Babak N. Araabi
A. Vahabie
DiffM
101
0
0
13 Sep 2025
Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
Yusuke Hirota
Ryo Hachiuma
Boyi Li
Ximing Lu
Michael Ross Boone
...
Marco Pavone
Yu-Chun Wang
Noa Garcia
Yuta Nakashima
Chao-Han Huck Yang
174
1
0
09 Sep 2025
Fine-Tuning Vision-Language Models for Visual Navigation Assistance
Fine-Tuning Vision-Language Models for Visual Navigation Assistance
Xiao Li
Bharat Gandhi
Ming Zhan
Mohit Nehra
Zhicheng Zhang
Yuchen Sun
Meijia Song
Naisheng Zhang
Xi Wang
42
0
0
09 Sep 2025
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Jiangnan Xie
Xiaolong Zheng
Liang Zheng
ObjD
169
0
0
08 Sep 2025
1234...192021
Next