ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,041 papers shown
Title
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Tianyi Bai
Yuxuan Fan
Jiantao Qiu
Fupeng Sun
Jiayi Song
Junlin Han
Zichen Liu
Conghui He
Wentao Zhang
Binhang Yuan
MLLMVLM
210
2
0
08 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
FREE: Fast and Robust Vision Language Models with Early ExitsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Divya J. Bajpai
M. Hanawal
VLM
133
2
0
07 Jun 2025
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance
Huy Le
Nhat Chung
Tung Kieu
A. Nguyen
Ngan Le
364
1
0
04 Jun 2025
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
Amit Peleg
Naman D. Singh
Matthias Hein
CoGeVLM
312
1
0
30 May 2025
A Mathematical Perspective On Contrastive Learning
A Mathematical Perspective On Contrastive Learning
Ricardo Baptista
Andrew Stuart
S. D. Tran
153
0
0
30 May 2025
From Theory to Application: Fine-Tuning Large EEG Model with Real-World Stress Data
From Theory to Application: Fine-Tuning Large EEG Model with Real-World Stress Data
Siwen Wang
Shitou Zhang
Wan-Lin Chen
Dung Truong
Tzyy-Ping Jung
126
1
0
29 May 2025
Revisiting Bayesian Model Averaging in the Era of Foundation Models
Revisiting Bayesian Model Averaging in the Era of Foundation Models
Mijung Park
UQCVMoMe
181
0
0
28 May 2025
Vision Transformers with Self-Distilled Registers
Vision Transformers with Self-Distilled Registers
Yinjie Chen
Zipeng Yan
Chong Zhou
Bo Dai
Andrew F. Luo
434
4
0
27 May 2025
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Di Wu
Yixin Wan
Kai-Wei Chang
252
1
0
26 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLMLRM
281
12
0
26 May 2025
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation ModelsInternational Conference on Information Photonics (ICIP), 2025
Mobina Mansoori
Sajjad Shahabodini
Farnoush Bayatmakou
J. Abouei
Konstantinos N. Plataniotis
Arash Mohammadi
157
1
0
26 May 2025
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology
PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology
Jiabo Ma
Yingxue Xu
Fengtao Zhou
Y. X. R. Wang
Cheng Jin
...
Xiuming Zhang
Li Liang
R. Chan
Zhe Wang
Huajun Chen
LM&MAVLM
147
10
0
26 May 2025
Progressive Scaling Visual Object Tracking
Progressive Scaling Visual Object Tracking
Jack Hong
Shilin Yan
Zehao Xiao
Jiayin Cai
Xiaolong Jiang
Yao Hu
Henghui Ding
283
1
0
26 May 2025
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering
Y. Chen
Wenjie Xiao
P. R. Bassi
Xinze Zhou
Sezgin Er
Ibrahim Ethem Hamamci
Zongwei Zhou
Yaoyao Liu
ELM
193
5
0
25 May 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Daniel Csizmadia
Andrei Codreanu
Victor Sim
Vighnesh Prabhu
Michael Lu
Kevin Zhu
Sean O'Brien
Sean O Brien
CLIPVLM
415
4
0
25 May 2025
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data
SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data
Dong-Hee Kim
Hyunjee Song
Donghyun Kim
447
1
0
23 May 2025
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts
Taewon Kang
Ming C. Lin
DiffMVGen
341
1
0
22 May 2025
Panoptic Captioning: An Equivalence Bridge for Image and Text
Panoptic Captioning: An Equivalence Bridge for Image and Text
Kun-Yu Lin
Hongjun Wang
Weining Ren
Kai Han
639
0
0
22 May 2025
Direct Preference Optimization for Adaptive Concept-based Explanations
Direct Preference Optimization for Adaptive Concept-based Explanations
Jacopo Teneggi
Zhenzhen Wang
Paul H. Yi
Tianmin Shu
Jeremias Sulam
477
0
0
21 May 2025
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment
Siming Sun
Kai Zhang
Xuejun Jiang
Wenchao Meng
Qinmin Yang
AI4TS
167
0
0
19 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
312
4
0
19 May 2025
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping
Subash Khanal
Srikumar Sastry
Aayush Dhakal
Adeel Ahmad
Nathan Jacobs
222
0
0
19 May 2025
Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning
Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning
Sriram Mandalika
VLM
234
1
0
16 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
GeoMM: On Geodesic Perspective for Multi-modal LearningComputer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
289
0
0
16 May 2025
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological AssessmentInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Siyuan Yan
Xiaochen Li
Ming Hu
Yiwen Jiang
Zhen Yu
Zongyuan Ge
MedImVLM
240
6
0
14 May 2025
Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization
Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization
Seongjae Kang
Dong Bok Lee
Hyungjoon Jang
Sung Ju Hwang
VLM
384
1
0
12 May 2025
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning
H. M. D. Kabir
S. Mondal
Mohammad Ali Moni
171
0
0
10 May 2025
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
Enhao Zhang
Chaohua Li
Chuanxing Geng
Songcan Chen
379
0
0
08 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Jiabo He
James Bailey
AAML
431
8
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Yixiao Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
880
4
0
07 May 2025
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
Muhammad Haris Khan
Miguel Altamirano Cabrera
Dmitrii Iarchuk
Yara Mahmoud
Daria Trinitatova
Issatay Tokmurziyev
Dzmitry Tsetserukou
VLM
194
0
0
05 May 2025
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models
Chaomeng Chen
Zitong Yu
Jin Song Dong
Sen Su
Linlin Shen
Shutao Xia
Simeng Qin
FedMLVLM
884
0
0
03 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Wenfa Wu
Guanyu Zhang
Zheng Tan
Yi Wang
Hongsheng Qi
AI4TS
213
2
0
02 May 2025
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
Hyun Lee
Chris Yi
Maminur Islam
B.D.S. Aritra
208
0
0
02 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
Jieneng Chen
LRM
409
21
0
01 May 2025
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Vasudev Sharma
Ahmed Alagha
Abdelhakim Khellaf
Vincent Quoc-Huy Trinh
Mahdi S. Hosseini
324
1
0
30 Apr 2025
Bayesian Principles Improve Prompt Learning In Vision-Language Models
Bayesian Principles Improve Prompt Learning In Vision-Language ModelsInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2025
Mingyu Kim
Jongwoo Ko
Mijung Park
VLM
338
1
0
19 Apr 2025
Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization
Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization
Hongwei Ji
Wulian Yun
Mengshi Qi
Huadong Ma
Huadong Ma
LRM
864
0
0
18 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjDVOS
608
97
0
17 Apr 2025
Can Masked Autoencoders Also Listen to Birds?
Can Masked Autoencoders Also Listen to Birds?
Lukas Rauch
Ilyass Moummad
René Heinrich
Alexis Joly
Bernhard Sick
Christoph Scholz
481
8
0
17 Apr 2025
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
Md. Sanaullah Chowdhury Lameya Sabrin
VLM
175
1
0
17 Apr 2025
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Kaiwen Zheng
Xuri Ge
Junchen Fu
Jun Peng
J. Jose
CVBM
187
0
0
14 Apr 2025
GFT: Gradient Focal Transformer
GFT: Gradient Focal Transformer
Boris Kriuk
Simranjit Kaur Gill
Shoaib Aslam
Amir Fakhrutdinov
176
0
0
14 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
Xuelong Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
215
19
0
14 Apr 2025
3D CoCa: Contrastive Learners are 3D Captioners
3D CoCa: Contrastive Learners are 3D Captioners
Ting Huang
Zhenru Zhang
Longji Xu
Hao Tang
252
6
0
13 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
325
0
0
11 Apr 2025
Kimi-VL Technical Report
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Longxiang Zhang
Zhe Chen
Zijia Zhao
Ziwei Chen
Zongyu Lin
MLLMVLMMoE
908
136
0
10 Apr 2025
A Survey of Pathology Foundation Model: Progress and Future Directions
A Survey of Pathology Foundation Model: Progress and Future DirectionsInternational Joint Conference on Artificial Intelligence (IJCAI), 2024
Conghao Xiong
Hao Chen
Joseph J. Y. Sung
LM&MAAI4CE
390
6
0
05 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
357
3
0
04 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
268
2
0
03 Apr 2025
Previous
123456...192021
Next