v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,043 papers shown

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

185

28 May 2024

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

Hongxia Yang

131

28 May 2024

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Cristian Rodriguez-Opazo

Ehsan Abbasnejad

Damien Teney

Edison Marrese-Taylor

Hamed Damirchi

Anton Van Den Hengel

VLM

348

27 May 2024

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Mohammed Nowaz Rabbani Chowdhury

Christopher Carothers

MoE

409

26 May 2024

ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text

Han Yu

Peikun Guo

Akane Sano

212

26 May 2024

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

212

25 May 2024

A Survey on Vision-Language-Action Models for Embodied AI

910

169

23 May 2024

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Ibrahim Alabdulmohsin

VLM

273

22 May 2024

More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models

196

22 May 2024

OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models

217

21 May 2024

Transcriptomics-guided Slide Representation Learning in Computational PathologyComputer Vision and Pattern Recognition (CVPR), 2024

Drew F. K. Williamson

Thomas Peeters

Andrew H. Song

Faisal Mahmood

299

19 May 2024

A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision

326

16 May 2024

PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology

...

311

16 May 2024

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Pavan Kumar Anasosalu Vasu

266

14 May 2024

Efficient Vision-Language Pre-training by Cluster MaskingComputer Vision and Pattern Recognition (CVPR), 2024

312

14 May 2024

All in One Framework for Multimodal Re-identification in the Wild

He Li

Mang Ye

Ming Zhang

Bo Du

291

08 May 2024

iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

Lorenzo Agnolucci

Alberto Baldrati

Marco Bertini

377

05 May 2024

Understanding Retrieval-Augmented Task Adaptation for Vision-Language ModelsInternational Conference on Machine Learning (ICML), 2024

Yifei Ming

Yixuan Li

VLM

294

02 May 2024

Hallucination of Multimodal Large Language Models: A Survey

Tianjun Xiao

Zheng Zhang

653

306

29 Apr 2024

Learning text-to-video retrieval from image captioning

272

26 Apr 2024

Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class

226

25 Apr 2024

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

...

Anant Nawalgaria

Jordi Pont-Tuset

Aida Nematzadeh

EGVM

996

25 Apr 2024

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

181

24 Apr 2024

MoDE: CLIP Data Experts via Clustering

Luke Zettlemoyer

261

24 Apr 2024

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Aaron Courville

259

24 Apr 2024

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Sachin Mehta

Maxwell Horton

Fartash Faghri

Mohammad Hossein Sekhavat

187

24 Apr 2024

Reconstructing the Image Stitching Pipeline: Integrating Fusion and Rectangling into a Unified Inpainting Model

228

23 Apr 2024

FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Jing Shi

205

23 Apr 2024

AutoAD III: The Prequel -- Back to the Pixels

312

22 Apr 2024

Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks

276

17 Apr 2024

Vocabulary-free Image Classification and Semantic Segmentation

221

16 Apr 2024

CNN-based explanation ensembling for dataset, representation and explanations evaluation

Weronika Hryniewska-Guzik

Luca Longo

P. Biecek

FAtt

206

16 Apr 2024

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

290

15 Apr 2024

The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning

276

15 Apr 2024

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

248

14 Apr 2024

TransformerFAM: Feedback attention is working memory

420

14 Apr 2024

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning

204

13 Apr 2024

ChimpVLM: Ethogram-Enhanced Chimpanzee Behaviour Recognition

157

13 Apr 2024

PM2: A New Prompting Multi-modal Model Paradigm for Few-shot Medical Image Classification

310

13 Apr 2024

COCONut: Modernizing COCO Segmentation

XueQing Deng

Qihang Yu

Peng Wang

Xiaohui Shen

Liang-Chieh Chen

206

12 Apr 2024

Improving Continuous Sign Language Recognition with Adapted Image Models

232

12 Apr 2024

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

Tianyu Zhu

M. Jung

Jesse Clark

433

12 Apr 2024

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Max Argus

Thomas Brox

518

11 Apr 2024

BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024

308

10 Apr 2024

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

A. Sophia Koepke

191

09 Apr 2024

Test-Time Zero-Shot Temporal Action Localization

303

08 Apr 2024

Hyperbolic Learning with Synthetic Captions for Open-World Detection

220

07 Apr 2024

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept MatchingNeural Information Processing Systems (NeurIPS), 2024

459

04 Apr 2024

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model PerformanceNeural Information Processing Systems (NeurIPS), 2024

Vishaal Udandarao

Christian Schroeder de Witt

705

04 Apr 2024

Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future DirectionsIEEE Reviews in Biomedical Engineering (RBME), 2024

Hao Chen

365

04 Apr 2024