Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.03334
Cited By
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
5 February 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
50 / 277 papers shown
Title
Cross-Modal Contrastive Learning for Robust Reasoning in VQA
Qinjie Zheng
Chaoyue Wang
Daqing Liu
Dadong Wang
Dacheng Tao
LRM
21
0
0
21 Nov 2022
Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
Xichen Pan
Pengda Qin
Yuhong Li
Hui Xue
Wenhu Chen
DiffM
16
62
0
20 Nov 2022
CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering
Yao Zhang
Haokun Chen
A. Frikha
Yezi Yang
Denis Krompass
Gengyuan Zhang
Jindong Gu
Volker Tresp
VLM
LRM
16
7
0
19 Nov 2022
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta
Aniruddha Kembhavi
ReLM
VLM
LRM
57
400
0
18 Nov 2022
Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey
Yuecong Xu
Haozhi Cao
Zhenghua Chen
Xiaoli Li
Lihua Xie
Jianfei Yang
24
14
0
17 Nov 2022
PromptCap: Prompt-Guided Task-Aware Image Captioning
Yushi Hu
Hang Hua
Zhengyuan Yang
Weijia Shi
Noah A. Smith
Jiebo Luo
38
101
0
15 Nov 2022
YORO -- Lightweight End to End Visual Grounding
Chih-Hui Ho
Srikar Appalaraju
Bhavan A. Jasani
R. Manmatha
Nuno Vasconcelos
ObjD
21
21
0
15 Nov 2022
Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA
Elias Stengel-Eskin
Jimena Guallar-Blasco
Yi Zhou
Benjamin Van Durme
UQLM
24
11
0
14 Nov 2022
ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation
Bin Shan
Yaqian Han
Weichong Yin
Shuohuan Wang
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
MLLM
VLM
11
7
0
09 Nov 2022
Deep Multimodal Fusion for Generalizable Person Re-identification
Suncheng Xiang
Hao Chen
Jing Gao
Jiawang Mou
Ting Liu
Dahong Qian
Yuzhuo Fu
24
5
0
02 Nov 2022
Training Vision-Language Models with Less Bimodal Supervision
Elad Segal
Ben Bogin
Jonathan Berant
VLM
19
2
0
01 Nov 2022
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
17
25
0
27 Oct 2022
Multilingual Multimodal Learning with Machine Translated Text
Chen Qiu
Dan Oneaţă
Emanuele Bugliarello
Stella Frank
Desmond Elliott
38
13
0
24 Oct 2022
Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction
Yue Yang
Artemis Panagopoulou
Marianna Apidianaki
Mark Yatskar
Chris Callison-Burch
21
2
0
24 Oct 2022
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
Zifeng Wang
Zhenbang Wu
Dinesh Agarwal
Jimeng Sun
CLIP
VLM
MedIm
29
394
0
18 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Xuran Pan
Tianzhu Ye
Dongchen Han
S. Song
Gao Huang
VLM
CLIP
22
43
0
17 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
Zihan Liu
Ziwei Ji
Dan Su
Pascale Fung
MLLM
VLM
21
62
0
14 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
26
1
0
12 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
37
11
0
10 Oct 2022
CLIP model is an Efficient Continual Learner
Vishal G. Thengane
Salman Khan
Munawar Hayat
F. Khan
BDL
VLM
CLL
104
46
0
06 Oct 2022
Uncertainty Estimation for Multi-view Data: The Power of Seeing the Whole Picture
M. Jung
He Zhao
Joanna Dipnall
Belinda Gabbe
Lan Du
UQCV
EDL
50
12
0
06 Oct 2022
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models
Guangyi Chen
Weiran Yao
Xiangchen Song
Xinyue Li
Yongming Rao
Kun Zhang
VPVLM
VLM
8
62
0
03 Oct 2022
Multimodal Analogical Reasoning over Knowledge Graphs
Ningyu Zhang
Lei Li
Xiang Chen
Xiaozhuan Liang
Shumin Deng
Huajun Chen
42
26
0
01 Oct 2022
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
49
28
0
28 Sep 2022
Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia
K. Nguyen
Ali Furkan Biten
Andrés Mafla
Lluís Gómez
Dimosthenis Karatzas
28
10
0
21 Sep 2022
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
Pan Lu
Swaroop Mishra
Tony Xia
Liang Qiu
Kai-Wei Chang
Song-Chun Zhu
Oyvind Tafjord
Peter Clark
A. Kalyan
ELM
ReLM
LRM
211
1,105
0
20 Sep 2022
Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
Yiren Jian
Chongyang Gao
Soroush Vosoughi
SSL
18
15
0
20 Sep 2022
VIPHY: Probing "Visible" Physical Commonsense Knowledge
Shikhar Singh
Ehsan Qasemi
Muhao Chen
29
6
0
15 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
25
68
0
14 Sep 2022
VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models
Felix Vogel
Nina Shvetsova
Leonid Karlinsky
Hilde Kuehne
VLM
57
7
0
12 Sep 2022
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
37
3
0
24 Aug 2022
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
...
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
49
629
0
22 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
26
11
0
19 Aug 2022
See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval
Xiujun Shu
Wei Wen
Haoqian Wu
Keyun Chen
Yi-Zhe Song
Ruizhi Qiao
Bohan Ren
Xiao Wang
19
91
0
18 Aug 2022
Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides
Dong Won Lee
Chaitanya Ahuja
Paul Pu Liang
Sanika Natu
Louis-Philippe Morency
15
7
0
17 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
22
67
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
14
1
0
31 Jul 2022
Visual correspondence-based explanations improve AI robustness and human-AI team accuracy
Giang Nguyen
Mohammad Reza Taesiri
Anh Totti Nguyen
20
42
0
26 Jul 2022
GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
ViT
25
106
0
20 Jul 2022
Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Bo Ren
Shutao Xia
VLM
68
49
0
05 Jul 2022
Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models
Yi Zhang
Junyan Wang
Jitao Sang
14
27
0
03 Jul 2022
LViT: Language meets Vision Transformer in Medical Image Segmentation
Zihan Li
Yunxiang Li
Qingde Li
Puyang Wang
Dazhou Guo
Le Lu
D. Jin
You Zhang
Qingqi Hong
VLM
MedIm
59
131
0
29 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
20
43
0
17 Jun 2022
Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval
Xiao Dong
Xunlin Zhan
Yunchao Wei
Xiaoyong Wei
Yaowei Wang
Minlong Lu
Xiaochun Cao
Xiaodan Liang
19
11
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
34
226
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
17
124
0
15 Jun 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
525
0
13 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
21
13
0
30 May 2022
Prompt-aligned Gradient for Prompt Tuning
Beier Zhu
Yulei Niu
Yucheng Han
Yuehua Wu
Hanwang Zhang
VLM
177
271
0
30 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
32
6
0
24 May 2022
Previous
1
2
3
4
5
6
Next