Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 910 papers shown
Title
Data-Efficient Generalization for Zero-shot Composed Image Retrieval
Zining Chen
Zhicheng Zhao
Fei Su
Xiaoqin Zhang
Shijian Lu
VLM
40
0
0
07 Mar 2025
Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models
Messi H.J. Lee
Soyeon Jeon
Jacob M. Montgomery
Calvin K Lai
VLM
CoGe
69
0
0
07 Mar 2025
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Songlong Xing
Zhengyu Zhao
N. Sebe
AAML
57
0
0
05 Mar 2025
Language-Assisted Feature Transformation for Anomaly Detection
EungGu Yun
Heonjin Ha
Yeongwoo Nam
Bryan Dongik Lee
61
0
0
03 Mar 2025
Enhancing Monocular 3D Scene Completion with Diffusion Model
Changlin Song
Jiaqi Wang
Liyun Zhu
He Weng
3DGS
33
0
0
02 Mar 2025
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions
R. Lucassen
Sander P.J. Moonemans
Tijn van de Luijtgaarden
Gerben E. Breimer
W. Blokx
M. Veta
MedIm
60
1
0
26 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
66
0
0
26 Feb 2025
TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video Synthesis
Menghao Li
Zhenghao Zhang
Junchao Liao
Long Qin
Weizhi Wang
DiffM
VGen
62
0
0
26 Feb 2025
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
R. Lucassen
Tijn van de Luijtgaarden
Sander P.J. Moonemans
Gerben E. Breimer
W. Blokx
M. Veta
63
0
0
26 Feb 2025
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
Mingkun Zhang
Keping Bi
Wei Chen
J. Guo
Xueqi Cheng
BDL
VLM
47
1
0
25 Feb 2025
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications
Ibrahim Fayad
Max Zimmer
Martin Schwartz
P. Ciais
Fabian Gieseke
Gabriel Belouze
Sarah Brood
A. D. Truchis
Alexandre d’Aspremont
AI4TS
38
0
0
24 Feb 2025
Infrared Image Super-Resolution: Systematic Review, and Future Trends
Y. Huang
Tomo Miyazaki
Xiao-Fang Liu
S. Omachi
SupR
74
10
0
21 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
66
0
0
21 Feb 2025
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
Weikang Qiu
Zheng Huang
Haoyu Hu
Aosong Feng
Yujun Yan
Rex Ying
43
0
0
18 Feb 2025
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
Shivansh Patel
Xinchen Yin
Wenlong Huang
Shubham Garg
H. Nayyeri
Li Fei-Fei
Svetlana Lazebnik
Y. Li
87
0
0
12 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
59
0
0
09 Feb 2025
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Feng Wang
Yaodong Yu
Guoyizhe Wei
Wei Shao
Yuyin Zhou
Alan Yuille
Cihang Xie
ViT
88
4
0
06 Feb 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta
Alberto Baldrati
Lorenzo Agnolucci
Marco Bertini
Andrew D. Bagdanov
CLIP
VLM
99
2
0
06 Feb 2025
Disentangling CLIP Features for Enhanced Localized Understanding
Samyak Rawelekar
Yujun Cai
Yiwei Wang
Ming-Hsuan Yang
N. Ahuja
VLM
CoGe
65
0
0
05 Feb 2025
Vision-Language Model Selection and Reuse for Downstream Adaptation
Hao-Zhe Tan
Zhi-Hua Zhou
Lan-Zhe Guo
Yu-Feng Li
VLM
88
0
0
30 Jan 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
99
1
0
28 Jan 2025
Diffusion Generative Modeling for Spatially Resolved Gene Expression Inference from Histology Images
Sichen Zhu
Yuchen Zhu
Molei Tao
Peng-Chao Qiu
MedIm
29
0
0
28 Jan 2025
Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLM
Payal Kamboj
Ayan Banerjee
Bin Xu
Sandeep K. S. Gupta
VLM
MedIm
31
0
0
27 Jan 2025
With Great Backbones Comes Great Adversarial Transferability
Erik Arakelyan
Karen Hambardzumyan
Davit Papikyan
Pasquale Minervini
Albert Gordo
Isabelle Augenstein
Aram H. Markosyan
AAML
60
0
0
21 Jan 2025
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
Alejandro Lozano
M. W. Sun
James Burgess
Liangyu Chen
Jeffrey Nirschl
...
Xiaohan Wang
Yuhui Zhang
Alfred Seunghoon Song
Robert Tibshirani
Serena Yeung-Levy
LM&MA
VLM
MedIm
58
5
0
13 Jan 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Wenhu Chen
MLLM
VLM
51
18
0
03 Jan 2025
Tuning Vision-Language Models with Candidate Labels by Prompt Alignment
Zhifang Zhang
Yuwei Niu
Xin Liu
Beibei Li
VPVLM
VLM
54
0
0
31 Dec 2024
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Y. Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
30
1
0
25 Dec 2024
Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations
Yi Zhang
Chun-Wun Cheng
Junyi He
Zhihai He
Carola-Bibiane Schonlieb
Yuyan Chen
Angelica I Aviles-Rivero
AI4TS
75
0
0
20 Dec 2024
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
Yue Zhang
Liqiang Jing
Vibhav Gogate
116
2
0
19 Dec 2024
Bringing Multimodality to Amazon Visual Search System
Xinliang Zhu
Michael Huang
Han Ding
Jinyu Yang
Kelvin Chen
...
Son Dinh Tran
Benjamin Z. Yao
Doug Gray
Anuj Bindal
Arnab Dhua
69
3
0
17 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
68
7
0
17 Dec 2024
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh
Hwanhee Jung
Jong Wook Kim
S. Lee
Innfarn Yoo
Andreas Lugmayr
Seunggeun Chi
K. Ramani
Sangpil Kim
3DGS
75
2
0
17 Dec 2024
SAMIC: Segment Anything with In-Context Spatial Prompt Engineering
S. Nagendra
Kashif Rashid
Chaopeng Shen
Daniel Kifer
VLM
69
2
0
16 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Jingdong Sun
Bukun Ren
Jun-Yan He
Qi Dai
Xian-Sheng Hua
VLM
78
0
0
14 Dec 2024
DiffCLIP: Few-shot Language-driven Multimodal Classifier
Jiaqing Zhang
Mingxiang Cao
Xue Yang
Kai Jiang
Yunsong Li
VLM
66
0
0
10 Dec 2024
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
Mingjie Xu
Mengyang Wu
Yuzhi Zhao
Jason Chun Lok Li
Weifeng Ou
LRM
SyDa
VLM
57
2
0
09 Dec 2024
Unified Framework for Open-World Compositional Zero-shot Learning
Hirunima Jayasekara
Khoi Pham
Nirat Saini
Abhinav Shrivastava
56
0
0
05 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLM
CLIP
69
2
0
04 Dec 2024
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Sanghwan Kim
Rui Xiao
Mariana-Iuliana Georgescu
Stephan Alaniz
Zeynep Akata
VLM
70
0
0
02 Dec 2024
CAREL: Instruction-guided reinforcement learning with cross-modal auxiliary objectives
Armin Saghafian
Amirmohammad Izadi
Negin Hashemi Dijujin
M. Baghshah
64
0
0
29 Nov 2024
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
Siqi Kou
Jiachun Jin
Chang Liu
Ye Ma
Jian Jia
Quan Chen
Peng Jiang
Zhijie Deng
Zhijie Deng
DiffM
VGen
VLM
109
5
0
28 Nov 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLM
VLM
69
0
0
27 Nov 2024
Evaluating Vision-Language Models as Evaluators in Path Planning
Mohamed Aghzal
Xiang Yue
E. Plaku
Ziyu Yao
LRM
72
1
0
27 Nov 2024
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference
Yuhang Yang
Jinhong Deng
Wen Li
Lixin Duan
VLM
71
0
0
24 Nov 2024
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation
Sule Bai
Yong-Jin Liu
Yifei Han
Haoji Zhang
Yansong Tang
VLM
72
3
0
24 Nov 2024
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam
Christopher Thomas
AAML
107
3
0
23 Nov 2024
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Ming Hu
Kun Yuan
Yaling Shen
Feilong Tang
Xiaohao Xu
...
Jin Ye
N. Padoy
Nassir Navab
Junjun He
Zongyuan Ge
VLM
CLIP
85
10
0
23 Nov 2024
Towards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer Sections
Xitong Ling
Yuanyuan Lei
Jiawen Li
Junru Cheng
Wenting Huang
Tian Guan
Jian Guan
Yonghong He
20
4
0
16 Nov 2024
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi
Minjing Dong
Chang Xu
VLM
27
1
0
14 Nov 2024
Previous
1
2
3
4
5
...
17
18
19
Next