Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 910 papers shown
Title
Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan
Heinrich Dinkel
Yongqing Wang
Jizhong Liu
Junbo Zhang
Yujun Wang
Bin Wang
VLM
27
4
0
11 Jun 2024
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Wanaiu Huang
18
1
0
10 Jun 2024
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment
Zijia Song
Z. Zang
Yelin Wang
Guozheng Yang
Jiangbin Zheng
Kaicheng Yu
Wanyu Chen
Stan Z. Li
31
0
0
09 Jun 2024
Understanding Information Storage and Transfer in Multi-modal Large Language Models
Samyadeep Basu
Martin Grayson
C. Morrison
Besmira Nushi
S. Feizi
Daniela Massiceti
18
10
0
06 Jun 2024
Low-Rank Similarity Mining for Multimodal Dataset Distillation
Yue Xu
Zhilin Lin
Yusong Qiu
Cewu Lu
Yong-Lu Li
DD
41
3
0
06 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
18
3
0
04 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
34
10
0
04 Jun 2024
Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)
Zane Durante
Robathan Harries
Edward Vendrow
Zelun Luo
Yuta Kyuragi
Kazuki Kozuka
Fei-Fei Li
Ehsan Adeli
VLM
25
0
0
03 Jun 2024
ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models
Thanh-Dat Truong
Xin Li
Bhiksha Raj
Jackson Cothren
Khoa Luu
DiffM
VLM
33
1
0
03 Jun 2024
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou
Longxiang Tang
Rui Yang
Guanyi Qin
Yan Zhang
Runze Hu
Xiu Li
29
5
0
03 Jun 2024
Quantum Visual Feature Encoding Revisited
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Hugh Churchill
Samee U. Khan
Khoa Luu
22
9
0
30 May 2024
QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Samuel Yen-Chi Chen
Samee U. Khan
Hugh Churchill
Khoa Luu
21
11
0
30 May 2024
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval
Xintong Jiang
Yaxiong Wang
Mengjian Li
Yujiao Wu
Bingwen Hu
Xueming Qian
CoGe
32
4
0
29 May 2024
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Vicky Zayats
Peter Chen
Melissa Ferrari
Dirk Padfield
AI4CE
30
0
0
29 May 2024
Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu
Radu Soricut
ViT
22
3
0
28 May 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Xin Xiao
Bohong Wu
Jiacong Wang
Chunyuan Li
Xun Zhou
Haoyuan Guo
VLM
34
7
0
28 May 2024
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu
Quanzeng You
Xiaotian Han
Yongfei Liu
Huaibo Huang
Ran He
Hongxia Yang
26
2
0
28 May 2024
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Cristian Rodriguez-Opazo
Ehsan Abbasnejad
Damien Teney
Edison Marrese-Taylor
Hamed Damirchi
A. Hengel
VLM
20
1
0
27 May 2024
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts
Mohammed Nowaz Rabbani Chowdhury
Meng Wang
K. E. Maghraoui
Naigang Wang
Pin-Yu Chen
Christopher Carothers
MoE
24
4
0
26 May 2024
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text
Han Yu
Peikun Guo
Akane Sano
34
14
0
26 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
64
41
0
23 May 2024
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angeline Pouget
Lucas Beyer
Emanuele Bugliarello
Xiao Wang
Andreas Steiner
Xiao-Qi Zhai
Ibrahim M. Alabdulmohsin
VLM
31
7
0
22 May 2024
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H.J. Lee
Jacob M. Montgomery
Calvin K Lai
VLM
29
0
0
22 May 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Zhaojian Yu
Yinghao Wu
Zhuotao Deng
Yansong Tang
Xiao-Ping Zhang
39
2
0
21 May 2024
Transcriptomics-guided Slide Representation Learning in Computational Pathology
Guillaume Jaume
Lukas Oldenburg
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Thomas Peeters
Andrew H. Song
Faisal Mahmood
40
21
0
19 May 2024
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude
Prajwal K R
Liliane Momeni
Hannah Bull
Samuel Albanie
Andrew Zisserman
Gül Varol
SLR
26
5
0
16 May 2024
PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology
George Shaikovski
Adam Casson
Kristen Severson
Eric Zimmermann
Yi Kan Wang
...
Peter Hamilton
William A. Moye
Eugene Vorontsov
Siqi Liu
Thomas J. Fuchs
MedIm
30
22
0
16 May 2024
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Oncel Tuzel
VLM
CLIP
22
6
0
14 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
21
6
0
14 May 2024
All in One Framework for Multimodal Re-identification in the Wild
He Li
Mang Ye
Ming Zhang
Bo Du
25
9
0
08 May 2024
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval
Lorenzo Agnolucci
Alberto Baldrati
Marco Bertini
A. Bimbo
35
9
0
05 May 2024
Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Yifei Ming
Yixuan Li
VLM
23
7
0
02 May 2024
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai
Pichao Wang
Tianjun Xiao
Tong He
Zongbo Han
Zheng Zhang
Mike Zheng Shou
VLM
LRM
80
139
0
29 Apr 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
31
3
0
26 Apr 2024
Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class
Mazda Moayeri
Michael G. Rabbat
Mark Ibrahim
Diane Bouchacourt
VLM
41
1
0
25 Apr 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
117
13
0
25 Apr 2024
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman
Stefan Lee
Scott D. Cohen
Kushal Kafle
VLM
23
5
0
24 Apr 2024
MoDE: CLIP Data Experts via Clustering
Jiawei Ma
Po-Yao Huang
Saining Xie
Shang-Wen Li
Luke Zettlemoyer
Shih-Fu Chang
Wen-tau Yih
Hu Xu
MoE
CLIP
VLM
18
10
0
24 Apr 2024
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani
Bac Nguyen
Samuel Lavoie
Ranjay Krishna
Aaron Courville
26
1
0
24 Apr 2024
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Sachin Mehta
Maxwell Horton
Fartash Faghri
Mohammad Hossein Sekhavat
Mahyar Najibi
Mehrdad Farajtabar
Oncel Tuzel
Mohammad Rastegari
VLM
CLIP
29
6
0
24 Apr 2024
Reconstructing the Image Stitching Pipeline: Integrating Fusion and Rectangling into a Unified Inpainting Model
Ziqi Xie
Weidong Zhao
Xianhui Liu
Jian Zhao
Ning Jia
26
0
0
23 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
42
9
0
23 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
36
20
0
22 Apr 2024
Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks
Eri Hosonuma
Taku Yamazaki
Takumi Miyoshi
Akihito Taya
Yuuki Nishiyama
K. Sezaki
DiffM
18
1
0
17 Apr 2024
Vocabulary-free Image Classification and Semantic Segmentation
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
35
2
0
16 Apr 2024
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Weronika Hryniewska-Guzik
Luca Longo
P. Biecek
FAtt
43
0
0
16 Apr 2024
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
Xiao Zhou
Xiaoman Zhang
Chaoyi Wu
Ya-Qin Zhang
Weidi Xie
Yanfeng Wang
VLM
27
6
0
15 Apr 2024
The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning
Yaohui Li
Qifeng Zhou
Haoxing Chen
Jianbing Zhang
Xinyu Dai
Hao Zhou
VLM
29
0
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
30
6
0
14 Apr 2024
Previous
1
2
3
4
5
6
...
17
18
19
Next