ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown
Title
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
Jing Liu
Peijie Wang
Jing Tao
Zhou Su
387
5
0
01 Apr 2025
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
S. Kapse
Pushpak Pati
Srikar Yellapragada
Srijan Das
Rajarsi R. Gupta
Joel H. Saltz
Dimitris Samaras
Prateek Prasanna
VLM
267
4
0
01 Apr 2025
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu
Yicheng Bao
Shaohui Lin
Xuhong Wang
Xin Tan
Longji Xu
Yuan Xie
Chaochao Lu
341
3
0
01 Apr 2025
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image SegmentationComputer Vision and Pattern Recognition (CVPR), 2025
Ting Liu
Siyuan Li
249
10
0
01 Apr 2025
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Thinesh Thiyakesan Ponbagavathi
Alina Roitberg
218
3
0
31 Mar 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Self-Evolving Visual Concept Library using Vision-Language CriticsComputer Vision and Pattern Recognition (CVPR), 2025
Atharva Sehgal
Patrick Yuan
Ziniu Hu
Yisong Yue
Jennifer J. Sun
Swarat Chaudhuri
VLM
216
2
0
31 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang
Hongxi Yan
Qiqi Zhan
Shuai Yang
Mingming Zhang
Yiming Lei
Chenkai Zhang
Zeming Liu
Qingjie Liu
Longji Xu
372
10
0
28 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Yuxiao Chen
L. Meng
Wujian Peng
Zuxuan Wu
Yu-Gang Jiang
VLM
454
4
0
24 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Towards Training-free Anomaly Detection with Vision and Language Foundation ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Jinjin Zhang
Guodong Wang
Yizhou Jin
Di Huang
228
10
0
24 Mar 2025
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Compositional Caching for Training-free Open-vocabulary Attribute DetectionComputer Vision and Pattern Recognition (CVPR), 2025
Marco Garosi
Alessandro Conti
Gaowen Liu
Elisa Ricci
Goran Frehse
ObjDVLM
331
1
0
24 Mar 2025
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval
Pranavi Kolouju
Eric Xing
Robert Pless
Nathan Jacobs
Abby Stylianou
3DV
194
5
0
22 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Vishwesh Ramanathan
Tony Xu
Pushpak Pati
Faruk Ahmed
Maged Goubran
Anne L. Martel
250
0
0
21 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Seeing What Matters: Empowering CLIP with Patch Generation-to-SelectionComputer Vision and Pattern Recognition (CVPR), 2025
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
297
5
0
21 Mar 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Siyuan Yan
Ming Hu
Yiwen Jiang
Xiaochen Li
Hao Fei
P. Tschandl
Harald Kittler
Zongyuan Ge
VLM
405
11
0
19 Mar 2025
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
Umar Farooq
Jean-Yves Guillemaut
Adrian Hilton
M. Volino
3DGS
412
3
0
18 Mar 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
...
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
280
3
0
18 Mar 2025
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Haozhe Si
Yuxuan Wan
Minh Do
Deepak Vasisht
Han Zhao
Hendrik Hamann
492
2
0
17 Mar 2025
Quantum EigenGame for excited state calculation
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
280
4
0
17 Mar 2025
Dynamic Relation Inference via Verb Embeddings
Dynamic Relation Inference via Verb Embeddings
Omri Suissa
Muhiim Ali
Ariana Azarbal
Hui Shen
Shekhar Pradhan
383
0
0
17 Mar 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
Andrii Zadaianchuk
Goran Frehse
MU
304
1
0
14 Mar 2025
Towards Understanding Graphical Perception in Large Multimodal Models
Kai Zhang
Jianwei Yang
J. Inala
Chandan Singh
Jianfeng Gao
Eric Fosler-Lussier
Chenglong Wang
295
2
0
13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective ReasoningThe Web Conference (WWW), 2025
Pengfei Luo
Jingbo Zhou
Tong Xu
Yuan Xia
Linli Xu
Tong Xu
LRM
350
6
0
13 Mar 2025
Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology ImagesIEEE journal of biomedical and health informatics (JBHI), 2025
M. Rahaman
Ewan K. A. Millar
Erik H. W. Meijering
VLM
305
2
0
13 Mar 2025
Multi-Modal Foundation Models for Computational Pathology: A Survey
Multi-Modal Foundation Models for Computational Pathology: A Survey
Dong Li
Guihong Wan
Xintao Wu
Xinyu Wu
Xiaohui Chen
Yi He
Christine G. Lian
Peter K. Sorger
Yevgeniy R. Semenov
Chen Zhao
MedIm
424
6
0
12 Mar 2025
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Tobias Christian Nauen
Brian B. Moser
Federico Raue
Stanislav Frolov
Andreas Dengel
ViT
453
0
0
12 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
372
4
0
10 Mar 2025
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
Md Azim Khan
A. Gangopadhyay
Jianwu Wang
Robert F. Erbacher
VLM
189
0
0
08 Mar 2025
Data-Efficient Generalization for Zero-shot Composed Image Retrieval
Zining Chen
Zhicheng Zhao
Fei Su
Xiaoqin Zhang
Shijian Lu
VLM
392
3
0
07 Mar 2025
Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models
Messi H.J. Lee
Soyeon Jeon
Jacob M. Montgomery
Calvin K. Lai
VLMCoGe
249
1
0
07 Mar 2025
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIPComputer Vision and Pattern Recognition (CVPR), 2025
Songlong Xing
Zhengyu Zhao
Andrii Zadaianchuk
AAML
554
10
0
05 Mar 2025
Language-Assisted Feature Transformation for Anomaly DetectionInternational Conference on Learning Representations (ICLR), 2025
EungGu Yun
Heonjin Ha
Yeongwoo Nam
Bryan Dongik Lee
436
2
0
03 Mar 2025
Enhancing Monocular 3D Scene Completion with Diffusion Model
Changlin Song
Jiaqi Wang
Liyun Zhu
He Weng
3DGS
193
0
0
02 Mar 2025
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and RetentionIEEE Transactions on Medical Imaging (IEEE TMI), 2025
Tianyi Wang
Jianan Fan
Dingxin Zhang
Dongnan Liu
Yong-quan Xia
Heng Huang
Weidong Cai
576
3
0
01 Mar 2025
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
R. Lucassen
Tijn van de Luijtgaarden
Sander P.J. Moonemans
Gerben E. Breimer
W. Blokx
M. Veta
374
0
0
26 Feb 2025
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic LesionsInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
R. Lucassen
Sander P.J. Moonemans
Tijn van de Luijtgaarden
Gerben E. Breimer
W. Blokx
M. Veta
MedIm
294
3
0
26 Feb 2025
TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video Synthesis
TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video SynthesisIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Menghao Li
Zhenghao Zhang
Junchao Liao
Long Qin
Weizhi Wang
DiffMVGen
251
1
0
26 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
263
7
0
26 Feb 2025
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot ClassificationInternational Conference on Learning Representations (ICLR), 2025
Mingkun Zhang
Keping Bi
Wei Chen
Jiafeng Guo
Xueqi Cheng
BDLVLM
469
8
0
25 Feb 2025
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications
Ibrahim Fayad
Max Zimmer
Martin Schwartz
P. Ciais
Fabian Gieseke
Gabriel Belouze
Sarah Brood
A. D. Truchis
Alexandre d’Aspremont
AI4TS
352
0
0
24 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
1.1K
0
0
21 Feb 2025
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
Weikang Qiu
Zheng Huang
Haoyu Hu
Aosong Feng
Yujun Yan
Rex Ying
397
10
0
18 Feb 2025
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect UnderstandingInternational Journal of Computer Vision (IJCV), 2025
Thanh-Dat Truong
Hoang-Quan Nguyen
Xuan-Bac Nguyen
Ashley Dowling
Pawan Sinha
Khoa Luu
MLLM
210
5
0
17 Feb 2025
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint RewardsIEEE International Conference on Robotics and Automation (ICRA), 2025
Shivansh Patel
Xinchen Yin
Wenlong Huang
Shubham Garg
H. Nayyeri
Li Fei-Fei
Svetlana Lazebnik
Yongqian Li
408
1
0
12 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search AdsThe Web Conference (WWW), 2025
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
200
1
0
09 Feb 2025
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Feng Wang
Yaodong Yu
Guoyizhe Wei
Wei Shao
Yuyin Zhou
Alan Yuille
Cihang Xie
ViT
384
17
0
06 Feb 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionInternational Conference on Learning Representations (ICLR), 2025
Marco Mistretta
Alberto Baldrati
Lorenzo Agnolucci
Marco Bertini
Andrew D. Bagdanov
CLIPVLM
443
14
0
06 Feb 2025
Vision-Language Model Selection and Reuse for Downstream Adaptation
Vision-Language Model Selection and Reuse for Downstream Adaptation
Hao-Zhe Tan
Zhi Zhou
Lan-Zhe Guo
Yu-Feng Li
VLM
356
0
0
30 Jan 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
336
14
0
28 Jan 2025
Diffusion Generative Modeling for Spatially Resolved Gene Expression Inference from Histology ImagesInternational Conference on Learning Representations (ICLR), 2025
Sichen Zhu
Yuchen Zhu
Molei Tao
Peng-Chao Qiu
MedIm
209
21
0
28 Jan 2025
Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLM
Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLMIEEE International Symposium on Biomedical Imaging (ISBI), 2025
Payal Kamboj
Ayan Banerjee
Bin Xu
Sandeep K. S. Gupta
VLMMedIm
159
0
0
27 Jan 2025
Previous
12345...192021
Next