v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown

Title
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Jie Ma Zhitao Gao Qi Chai Jing Liu Peijie Wang Jing Tao Zhou Su 387 5 0 01 Apr 2025
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology S. Kapse Pushpak Pati Srikar Yellapragada Srijan Das Rajarsi R. Gupta Joel H. Saltz Dimitris Samaras Prateek Prasanna VLM 267 4 0 01 Apr 2025
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval Bangwei Liu Yicheng Bao Shaohui Lin Xuhong Wang Xin Tan Longji Xu Yuan Xie Chaochao Lu 341 3 0 01 Apr 2025
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image SegmentationComputer Vision and Pattern Recognition (CVPR), 2025 Ting Liu Siyuan Li 249 10 0 01 Apr 2025
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions Thinesh Thiyakesan Ponbagavathi Alina Roitberg 218 3 0 31 Mar 2025
Self-Evolving Visual Concept Library using Vision-Language CriticsComputer Vision and Pattern Recognition (CVPR), 2025 Atharva Sehgal Patrick Yuan Ziniu Hu Yisong Yue Jennifer J. Sun Swarat Chaudhuri VLM 216 2 0 31 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality Ziyue Huang Hongxi Yan Qiqi Zhan Shuai Yang Mingming Zhang Yiming Lei Chenkai Zhang Zeming Liu Qingjie Liu Longji Xu 372 10 0 28 Mar 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models Yuxiao Chen L. Meng Wujian Peng Zuxuan Wu Yu-Gang Jiang VLM 454 4 0 24 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation ModelsComputer Vision and Pattern Recognition (CVPR), 2025 Jinjin Zhang Guodong Wang Yizhou Jin Di Huang 228 10 0 24 Mar 2025
Compositional Caching for Training-free Open-vocabulary Attribute DetectionComputer Vision and Pattern Recognition (CVPR), 2025 Marco Garosi Alessandro Conti Gaowen Liu Elisa Ricci Goran Frehse ObjD VLM 331 1 0 24 Mar 2025
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval Pranavi Kolouju Eric Xing Robert Pless Nathan Jacobs Abby Stylianou 3DV 194 5 0 22 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology Vishwesh Ramanathan Tony Xu Pushpak Pati Faruk Ahmed Maged Goubran Anne L. Martel 250 0 0 21 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-SelectionComputer Vision and Pattern Recognition (CVPR), 2025 Gensheng Pei Tao Chen Yujia Wang Xinhao Cai Xiangbo Shu Tianfei Zhou Yazhou Yao VLM 297 5 0 21 Mar 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology Siyuan Yan Ming Hu Yiwen Jiang Xiaochen Li Hao Fei P. Tschandl Harald Kittler Zongyuan Ge VLM 405 11 0 19 Mar 2025
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation Umar Farooq Jean-Yves Guillemaut Adrian Hilton M. Volino 3DGS 412 3 0 18 Mar 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance Weixiong Lin Chen Ju Haicheng Wang Shengchao Hu Shuai Xiao ... Yuheng Jiao Mingshuai Yao Jinsong Lan Qingwen Liu Ying Chen 280 3 0 18 Mar 2025
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data Haozhe Si Yuxuan Wan Minh Do Deepak Vasisht Han Zhao Hendrik Hamann 492 2 0 17 Mar 2025
Quantum EigenGame for excited state calculation David Quiroga Jason Han Anastasios Kyrillidis 280 4 0 17 Mar 2025
Dynamic Relation Inference via Verb Embeddings Omri Suissa Muhiim Ali Ariana Azarbal Hui Shen Shekhar Pradhan 383 0 0 17 Mar 2025
Safe Vision-Language Models via Unsafe Weights Manipulation Moreno DÍncà E. Peruzzo Xingqian Xu Humphrey Shi Andrii Zadaianchuk Goran Frehse MU 304 1 0 14 Mar 2025
Towards Understanding Graphical Perception in Large Multimodal Models Kai Zhang Jianwei Yang J. Inala Chandan Singh Jianfeng Gao Eric Fosler-Lussier Chenglong Wang 295 2 0 13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective ReasoningThe Web Conference (WWW), 2025 Pengfei Luo Jingbo Zhou Tong Xu Yuan Xia Linli Xu Tong Xu LRM 350 6 0 13 Mar 2025
Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology ImagesIEEE journal of biomedical and health informatics (JBHI), 2025 M. Rahaman Ewan K. A. Millar Erik H. W. Meijering VLM 305 2 0 13 Mar 2025
Multi-Modal Foundation Models for Computational Pathology: A Survey Dong Li Guihong Wan Xintao Wu Xinyu Wu Xiaohui Chen Yi He Christine G. Lian Peter K. Sorger Yevgeniy R. Semenov Chen Zhao MedIm 424 6 0 12 Mar 2025
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation Tobias Christian Nauen Brian B. Moser Federico Raue Stanislav Frolov Andreas Dengel ViT 453 0 0 12 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes! Raphi Kang Yue Song Georgia Gkioxari Pietro Perona VLM 372 4 0 10 Mar 2025
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models Md Azim Khan A. Gangopadhyay Jianwu Wang Robert F. Erbacher VLM 189 0 0 08 Mar 2025
Data-Efficient Generalization for Zero-shot Composed Image Retrieval Zining Chen Zhicheng Zhao Fei Su Xiaoqin Zhang Shijian Lu VLM 392 3 0 07 Mar 2025
Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models Messi H.J. Lee Soyeon Jeon Jacob M. Montgomery Calvin K. Lai VLM CoGe 249 1 0 07 Mar 2025
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIPComputer Vision and Pattern Recognition (CVPR), 2025 Songlong Xing Zhengyu Zhao Andrii Zadaianchuk AAML 554 10 0 05 Mar 2025
Language-Assisted Feature Transformation for Anomaly DetectionInternational Conference on Learning Representations (ICLR), 2025 EungGu Yun Heonjin Ha Yeongwoo Nam Bryan Dongik Lee 436 2 0 03 Mar 2025
Enhancing Monocular 3D Scene Completion with Diffusion Model Changlin Song Jiaqi Wang Liyun Zhu He Weng 3DGS 193 0 0 02 Mar 2025
MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and RetentionIEEE Transactions on Medical Imaging (IEEE TMI), 2025 Tianyi Wang Jianan Fan Dingxin Zhang Dongnan Liu Yong-quan Xia Heng Huang Weidong Cai 576 3 0 01 Mar 2025
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation R. Lucassen Tijn van de Luijtgaarden Sander P.J. Moonemans Gerben E. Breimer W. Blokx M. Veta 374 0 0 26 Feb 2025
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic LesionsInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025 R. Lucassen Sander P.J. Moonemans Tijn van de Luijtgaarden Gerben E. Breimer W. Blokx M. Veta MedIm 294 3 0 26 Feb 2025
TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video SynthesisIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025 Menghao Li Zhenghao Zhang Junchao Liao Long Qin Weizhi Wang DiffM VGen 251 1 0 26 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP Chenyang Zhao Kun Wang J. H. Hsiao Antoni B. Chan CLIP 263 7 0 26 Feb 2025
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot ClassificationInternational Conference on Learning Representations (ICLR), 2025 Mingkun Zhang Keping Bi Wei Chen Jiafeng Guo Xueqi Cheng BDL VLM 469 8 0 25 Feb 2025
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications Ibrahim Fayad Max Zimmer Martin Schwartz P. Ciais Fabian Gieseke Gabriel Belouze Sarah Brood A. D. Truchis Alexandre d’Aspremont AI4TS 352 0 0 24 Feb 2025
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval Guanqi Zhan Yuanpei Liu Kai Han Weidi Xie Andrew Zisserman VLM 1.1K 0 0 21 Feb 2025
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding Weikang Qiu Zheng Huang Haoyu Hu Aosong Feng Yujun Yan Rex Ying 397 10 0 18 Feb 2025
Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect UnderstandingInternational Journal of Computer Vision (IJCV), 2025 Thanh-Dat Truong Hoang-Quan Nguyen Xuan-Bac Nguyen Ashley Dowling Pawan Sinha Khoa Luu MLLM 210 5 0 17 Feb 2025
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint RewardsIEEE International Conference on Robotics and Automation (ICRA), 2025 Shivansh Patel Xinchen Yin Wenlong Huang Shubham Garg H. Nayyeri Li Fei-Fei Svetlana Lazebnik Yongqian Li 408 1 0 12 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search AdsThe Web Conference (WWW), 2025 Guobing Gan Kaiming Gao Li Wang Shen Jiang Peng Jiang 200 1 0 09 Feb 2025
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More Feng Wang Yaodong Yu Guoyizhe Wei Wei Shao Yuyin Zhou Alan Yuille Cihang Xie ViT 384 17 0 06 Feb 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionInternational Conference on Learning Representations (ICLR), 2025 Marco Mistretta Alberto Baldrati Lorenzo Agnolucci Marco Bertini Andrew D. Bagdanov CLIP VLM 443 14 0 06 Feb 2025
Vision-Language Model Selection and Reuse for Downstream Adaptation Hao-Zhe Tan Zhi Zhou Lan-Zhe Guo Yu-Feng Li VLM 356 0 0 30 Jan 2025
Audio-Language Models for Audio-Centric Tasks: A survey Yi Su Jisheng Bai Qisheng Xu Kele Xu Yong Dou AuLLM 336 14 0 28 Jan 2025
Diffusion Generative Modeling for Spatially Resolved Gene Expression Inference from Histology ImagesInternational Conference on Learning Representations (ICLR), 2025 Sichen Zhu Yuchen Zhu Molei Tao Peng-Chao Qiu MedIm 209 21 0 28 Jan 2025
Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLMIEEE International Symposium on Biomedical Imaging (ISBI), 2025 Payal Kamboj Ayan Banerjee Bin Xu Sandeep K. S. Gupta VLM MedIm 159 0 0 27 Jan 2025