Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2205.01917
Cited By
v1
v2 (latest)
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (3 upvotes)
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 1,041 papers shown
Title
Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Hoang-Quan Nguyen
Thanh-Dat Truong
Xuan-Bac Nguyen
Ashley Dowling
Pawan Sinha
Khoa Luu
VLM
317
29
0
26 Nov 2023
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann
Tim Dockhorn
Sumith Kulal
Daniel Mendelevitch
Maciej Kilian
...
Zion English
Vikram S. Voleti
Adam Letts
Varun Jampani
Robin Rombach
VGen
885
1,892
0
25 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
241
4
0
25 Nov 2023
Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective
Sahil Verma
Gantavya Bhatt
Avi Schwarzschild
Soumye Singhal
Arnav M. Das
Chirag Shah
John P Dickerson
Jeff Bilmes
J. Bilmes
AAML
242
1
0
25 Nov 2023
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
European Conference on Computer Vision (ECCV), 2023
Lingchen Meng
Shiyi Lan
Hengduo Li
Jose M. Alvarez
Zuxuan Wu
Yu-Gang Jiang
VLM
ISeg
MLLM
225
14
0
24 Nov 2023
T-Rex: Counting by Visual Prompting
Qing Jiang
Feng Li
Tianhe Ren
Shilong Liu
Zhaoyang Zeng
Kent Yu
Lei Zhang
185
20
0
22 Nov 2023
Vamos: Versatile Action Models for Video Understanding
European Conference on Computer Vision (ECCV), 2023
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
334
35
0
22 Nov 2023
Breathing Life Into Sketches Using Text-to-Video Priors
Computer Vision and Pattern Recognition (CVPR), 2023
Rinon Gal
Yael Vinker
Yuval Alaluf
Amit H. Bermano
Daniel Cohen-Or
Ariel Shamir
Gal Chechik
VGen
DiffM
166
47
0
21 Nov 2023
Controlling the Output of a Generative Model by Latent Feature Vector Shifting
Róbert Belanec
Peter Lacko
Kristína Malinovská
145
3
0
15 Nov 2023
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder
Abdelrahman Mohamed
Fakhraddin Alwajih
El Moatez Billah Nagoudi
Alcides Alcoba Inciarte
Muhammad Abdul-Mageed
VLM
MLLM
147
12
0
15 Nov 2023
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding
WonJun Moon
Sangeek Hyun
Subeen Lee
Jae-Pil Heo
353
12
0
15 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
206
8
0
14 Nov 2023
DRUformer: Enhancing the driving scene Important object detection with driving relationship self-understanding
Yingjie Niu
Ming Ding
Keisuke Fujii
Kento Ohtani
Alexander Carballo
K. Takeda
ViT
176
0
0
11 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
325
362
0
10 Nov 2023
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong
Kai Zhang
Jiuxiang Gu
Sai Bi
Yang Zhou
Difan Liu
Feng Liu
Kalyan Sunkavalli
Trung Bui
Hao Tan
3DV
3DH
465
662
0
08 Nov 2023
OmniVec: Learning robust representations with cross modal sharing
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Siddharth Srivastava
Gaurav Sharma
SSL
260
81
0
07 Nov 2023
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
Neural Information Processing Systems (NeurIPS), 2023
Cheng Cheng
Lin Song
Ruoyi Xue
Hang Wang
Hongbin Sun
Yixiao Ge
Ying Shan
VLM
ObjD
383
45
0
07 Nov 2023
GLaMM: Pixel Grounding Large Multimodal Model
Computer Vision and Pattern Recognition (CVPR), 2023
H. Rasheed
Muhammad Maaz
Sahal Shaji Mullappilly
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakkal
Rao M. Anwer
Erix Xing
Ming-Hsuan Yang
Fahad S. Khan
MLLM
VLM
413
384
0
06 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
Neural Information Processing Systems (NeurIPS), 2023
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
599
699
0
06 Nov 2023
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models
Neural Information Processing Systems (NeurIPS), 2023
Andy Zhou
Jindong Wang
Yu-Xiong Wang
Haohan Wang
VLM
216
8
0
02 Nov 2023
Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Xue-mei Hu
Ce Zhang
Yi Zhang
Bowen Hai
Ke Yu
Zhihai He
MDE
VLM
241
21
0
02 Nov 2023
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
IEEE International Conference on Robotics and Automation (ICRA), 2023
P. Sermanet
Tianli Ding
Jeffrey Zhao
Fei Xia
Debidatta Dwibedi
...
Pannag R Sanketi
Karol Hausman
Izhak Shafran
Brian Ichter
Yuan Cao
LM&Ro
227
96
0
01 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Computer Vision and Pattern Recognition (CVPR), 2023
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
247
17
0
01 Nov 2023
CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders
Neural Information Processing Systems (NeurIPS), 2023
A. Fuller
K. Millard
James R. Green
239
126
0
01 Nov 2023
fMRI-PTE: A Large-scale fMRI Pretrained Transformer Encoder for Multi-Subject Brain Activity Decoding
Xuelin Qian
Yun Wang
Jingyang Huo
Jianfeng Feng
Yanwei Fu
MedIm
128
14
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Information Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
369
69
0
01 Nov 2023
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Neural Information Processing Systems (NeurIPS), 2023
Micah Goldblum
Hossein Souri
Renkun Ni
Manli Shu
Viraj Prabhu
...
Adrien Bardes
Judy Hoffman
Ramalingam Chellappa
Andrew Gordon Wilson
Tom Goldstein
VLM
432
91
0
30 Oct 2023
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRM
CoGe
342
200
0
30 Oct 2023
Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
Neural Information Processing Systems (NeurIPS), 2023
Qi Qian
Yuanhong Xu
Juhua Hu
VLM
CLIP
249
26
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
318
3
0
30 Oct 2023
Foundation Models for Generalist Geospatial Artificial Intelligence
Johannes Jakubik
Sujit Roy
C. Phillips
P. Fraccaro
Denys Godwin
...
Hamed Alemohammad
M. Maskey
R. Ganti
Kommy Weldemariam
Rahul Ramachandran
AI4CE
VLM
317
152
0
28 Oct 2023
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Neural Information Processing Systems (NeurIPS), 2023
Chuofan Ma
Yi Jiang
Xin Wen
Zehuan Yuan
Xiaojuan Qi
ObjD
VLM
218
66
0
25 Oct 2023
MotionAGFormer: Enhancing 3D Human Pose Estimation with a Transformer-GCNFormer Network
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Soroush Mehraban
Vida Adeli
Babak Taati
ViT
294
105
0
25 Oct 2023
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track
Shuhei Yokoo
Peifei Zhu
Yuchi Ishikawa
Mikihiro Tanaka
Masayoshi Kondo
Hirokatsu Kataoka
69
1
0
23 Oct 2023
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
Mohammadreza Salehi
Mehrdad Farajtabar
Maxwell Horton
Fartash Faghri
Hadi Pouransari
Raviteja Vemulapalli
Oncel Tuzel
Ali Farhadi
Mohammad Rastegari
Sachin Mehta
CLIP
VLM
187
3
0
21 Oct 2023
SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem
Yongqin Xian
Xiaohua Zhai
Lukas Hoyer
Luc Van Gool
F. Tombari
VLM
242
55
0
20 Oct 2023
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
K. A. Noriy
Xiaosong Yang
Marcin Budka
Jian Jun Zhang
VLM
250
5
0
18 Oct 2023
RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models
IEEE International Conference on Robotics and Automation (ICRA), 2023
Zijun Long
George Killick
R. McCreadie
Gerardo Aragon Camarasa
VLM
233
23
0
16 Oct 2023
Few-shot Action Recognition with Captioning Foundation Models
Xiang Wang
Shiwei Zhang
Hangjie Yuan
Yingya Zhang
Changxin Gao
Deli Zhao
Nong Sang
VLM
306
9
0
16 Oct 2023
CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes
Yulei Qin
Xingyu Chen
Chunjiang Ge
Chaoyou Fu
Yun Gu
Ke Li
Xing Sun
Rongrong Ji
224
3
0
15 Oct 2023
Vision-by-Language for Training-Free Compositional Image Retrieval
Shyamgopal Karthik
Karsten Roth
Goran Frehse
Zeynep Akata
CoGe
321
86
0
13 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
279
138
0
13 Oct 2023
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models
International Conference on Learning Representations (ICLR), 2023
Vishaal Udandarao
Max F. Burg
Samuel Albanie
Matthias Bethge
VLM
291
11
0
12 Oct 2023
Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models
Neural Information Processing Systems (NeurIPS), 2023
Beier Zhu
Kaihua Tang
Qianru Sun
Hanwang Zhang
230
32
0
12 Oct 2023
Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning
ACM Multimedia (ACM MM), 2023
Jiaqi Li
Guilin Qi
Chuanyi Zhang
Yongrui Chen
Yiming Tan
Chenlong Xia
Ye Tian
173
6
0
12 Oct 2023
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Haoyi Zhu
Honghui Yang
Xiaoyang Wu
Di Huang
Sha Zhang
...
Hengshuang Zhao
Chunhua Shen
Yu Qiao
Tong He
Wanli Ouyang
SSL
517
54
0
12 Oct 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
European Conference on Computer Vision (ECCV), 2023
Zhengfeng Lai
Haotian Zhang
Bowen Zhang
Wentao Wu
Haoping Bai
...
Zhe Gan
Jiulong Shan
Chen-Nee Chuah
Yinfei Yang
Meng Cao
CLIP
VLM
306
57
0
11 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
128
5
0
08 Oct 2023
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling
Haogeng Liu
Qihang Fan
Tingkai Liu
Linjie Yang
Yunzhe Tao
Huaibo Huang
Ran He
Hongxia Yang
VGen
247
15
0
08 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Neural Information Processing Systems (NeurIPS), 2023
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
170
12
0
06 Oct 2023
Previous
1
2
3
...
11
12
13
...
19
20
21
Next