Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2108.10904
Cited By
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
24 August 2021
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SimVLM: Simple Visual Language Model Pretraining with Weak Supervision"
50 / 565 papers shown
Title
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
14
9
0
20 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
28
13
0
19 Nov 2022
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Yao Zhang
Haokun Chen
A. Frikha
Yezi Yang
Denis Krompass
Gengyuan Zhang
Jindong Gu
Volker Tresp
VLM
LRM
8
7
0
19 Nov 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
Hao Li
Jinguo Zhu
Xiaohu Jiang
Xizhou Zhu
Hongsheng Li
...
Xiaohua Wang
Yu Qiao
Xiaogang Wang
Wenhai Wang
Jifeng Dai
MLLM
13
55
0
17 Nov 2022
InstructPix2Pix: Learning to Follow Image Editing Instructions
Tim Brooks
Aleksander Holynski
Alexei A. Efros
DiffM
13
1,670
0
17 Nov 2022
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Pengpeng Zeng
Jinkuan Zhu
Jingkuan Song
Lianli Gao
VLM
12
27
0
17 Nov 2022
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge
Linli Yao
Wei-Neng Chen
Qin Jin
VLM
14
10
0
17 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
28
101
0
15 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
MLLM
VLM
9
7
0
09 Nov 2022
Text-Only Training for Image Captioning using Noise-Injected CLIP
David Nukrai
Ron Mokady
Amir Globerson
VLM
CLIP
41
69
0
01 Nov 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning
Suvir Mirchandani
Licheng Yu
Mengjiao MJ Wang
Animesh Sinha
Wen-Jun Jiang
Tao Xiang
Ning Zhang
19
16
0
26 Oct 2022
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Q. Si
Yuanxin Liu
Zheng Lin
Peng Fu
Weiping Wang
VLM
21
1
0
26 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
16
6
0
24 Oct 2022
Composing Ensembles of Pre-trained Models via Iterative Consensus
Shuang Li
Yilun Du
J. Tenenbaum
Antonio Torralba
Igor Mordatch
MoMe
16
23
0
20 Oct 2022
CPL: Counterfactual Prompt Learning for Vision and Language Models
Xuehai He
Diji Yang
Weixi Feng
Tsu-jui Fu
Arjun Reddy Akula
Varun Jampani
P. Narayana
Sugato Basu
William Yang Wang
X. Wang
VPVLM
VLM
36
15
0
19 Oct 2022
Non-Contrastive Learning Meets Language-Image Pre-Training
Jinghao Zhou
Li Dong
Zhe Gan
Lijuan Wang
Furu Wei
VLM
CLIP
9
25
0
17 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
14
42
0
17 Oct 2022
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
A. M. H. Tiong
Junnan Li
Boyang Albert Li
Silvio Savarese
S. Hoi
MLLM
13
101
0
17 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
6
61
0
14 Oct 2022
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Manas
Pau Rodríguez López
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Aishwarya Agrawal
VLM
VPVLM
8
48
0
13 Oct 2022
ImaginaryNet: Learning Object Detectors without Real Images and Annotations
Minheng Ni
Zitong Huang
Kai-Hua Feng
W. Zuo
VLM
11
15
0
13 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
13
1
0
12 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
14
29
0
11 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
19
21
0
09 Oct 2022
Retrieval Augmented Visual Question Answering with Outside Knowledge
Weizhe Lin
Bill Byrne
RALM
74
68
0
07 Oct 2022
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu
An Yan
Yujie Lu
Wenda Xu
X. Wang
Miguel P. Eckstein
William Yang Wang
74
37
0
07 Oct 2022
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Kenton Lee
Mandar Joshi
Iulia Turc
Hexiang Hu
Fangyu Liu
Julian Martin Eisenschlos
Urvashi Khandelwal
Peter Shaw
Ming-Wei Chang
Kristina Toutanova
CLIP
VLM
148
259
0
07 Oct 2022
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath
Peter Anderson
Su Wang
Jing Yu Koh
Alexander Ku
Austin Waters
Yinfei Yang
Jason Baldridge
Zarana Parekh
LM&Ro
15
45
0
06 Oct 2022
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text
Wenhu Chen
Hexiang Hu
Xi Chen
Pat Verga
William W. Cohen
RALM
8
71
0
06 Oct 2022
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Weicheng Kuo
Yin Cui
Xiuye Gu
A. Piergiovanni
A. Angelova
MLLM
VLM
ObjD
35
134
0
30 Sep 2022
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
R. Ramos
Bruno Martins
Desmond Elliott
Yova Kementchedjhieva
VLM
14
86
0
30 Sep 2022
Domain-aware Self-supervised Pre-training for Label-Efficient Meme Analysis
Shivam Sharma
Mohd Khizir Siddiqui
Md. Shad Akhtar
Tanmoy Chakraborty
SSL
20
5
0
29 Sep 2022
Paraphrasing Is All You Need for Novel Object Captioning
Cheng Yang
Yao-Hung Hubert Tsai
Wanshu Fan
Ruslan Salakhutdinov
Louis-Philippe Morency
Yu-Chiang Frank Wang
30
4
0
25 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
24
148
0
15 Sep 2022
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Zhihong Chen
Guanbin Li
Xiang Wan
110
65
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
21
8
0
14 Sep 2022
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen
Xiao Wang
Soravit Changpinyo
A. Piergiovanni
Piotr Padlewski
...
Andreas Steiner
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
MLLM
VLM
13
529
0
14 Sep 2022
Learning to Evaluate Performance of Multi-modal Semantic Localization
Zhiqiang Yuan
Wenkai Zhang
Chongyang Li
Zhaoying Pan
Yongqiang Mao
Jialiang Chen
Shuoke Li
Hongqi Wang
Xian Sun
13
20
0
14 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
14
68
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
125
29
0
12 Sep 2022
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo
Linting Xue
Michal Yarom
Ashish V. Thapliyal
Idan Szpektor
J. Amelot
Xi Chen
Radu Soricut
23
8
0
12 Sep 2022
VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models
Felix Vogel
Nina Shvetsova
Leonid Karlinsky
Hilde Kuehne
VLM
48
7
0
12 Sep 2022
Pre-training image-language transformers for open-vocabulary tasks
A. Piergiovanni
Weicheng Kuo
A. Angelova
VLM
ViT
26
8
0
09 Sep 2022
Statistical Foundation Behind Machine Learning and Its Impact on Computer Vision
Lei Zhang
H. Shum
VLM
SSL
8
2
0
06 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
19
26
0
29 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
11
625
0
22 Aug 2022
Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan
Chunhui Ai
Ziqiang Cao
Min Cao
Sujian Li
Wen-Yi Chen
G. Fu
10
0
0
22 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
27
30
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
16
66
0
03 Aug 2022
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
11
46
0
26 Jul 2022
Previous
1
2
3
...
10
11
12
8
9
Next