Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,232 papers shown
Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge
Riza Velioglu
J. Rose
VLM
121
103
0
23 Dec 2020
Training data-efficient image transformers & distillation through attention
International Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
649
8,277
0
23 Dec 2020
A Multimodal Framework for the Detection of Hateful Memes
Phillip Lippe
Nithin Holla
Shantanu Chandra
S. Rajamanickam
Georgios Antoniou
Ekaterina Shutova
H. Yannakoudakis
285
91
0
23 Dec 2020
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
LRM
333
50
0
22 Dec 2020
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
AAAI Conference on Artificial Intelligence (AAAI), 2020
Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby B. Lee
Jindong Chen
Blaise Agüera y Arcas
261
88
0
22 Dec 2020
Object-Centric Diagnosis of Visual Reasoning
Jianwei Yang
Jiayuan Mao
Jiajun Wu
Devi Parikh
David D. Cox
J. Tenenbaum
Chuang Gan
OCL
193
17
0
21 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Computer Vision and Pattern Recognition (CVPR), 2020
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
272
225
0
20 Dec 2020
Transformer Interpretability Beyond Attention Visualization
Computer Vision and Pattern Recognition (CVPR), 2020
Hila Chefer
Shir Gur
Lior Wolf
421
864
0
17 Dec 2020
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification
AAAI Conference on Artificial Intelligence (AAAI), 2020
Te-Lin Wu
Shikhar Singh
S. Paul
Gully A. Burns
Nanyun Peng
114
21
0
16 Dec 2020
ReINTEL: A Multimodal Data Challenge for Responsible Information Identification on Social Network Sites
Duc-Trong Le
Xuan-Son Vu
Nhu-Dung To
Huu Nguyen
Thuy-Trinh Nguyen
...
A. Nguyen
Minh-Duc Hoang
Nghia T. Le
Huyen Thi Minh Nguyen
Hoang D. Nguyen
175
15
0
16 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
263
50
0
15 Dec 2020
Attention over learned object embeddings enables complex visual reasoning
Neural Information Processing Systems (NeurIPS), 2020
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
366
78
0
15 Dec 2020
Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes
Niklas Muennighoff
155
73
0
14 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Knowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
256
42
0
13 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
Guang Dai
Jianfeng Gao
Zicheng Liu
VLM
MLLM
235
70
0
13 Dec 2020
Look Before you Speak: Visually Contextualized Utterances
Computer Vision and Pattern Recognition (CVPR), 2020
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
312
71
0
10 Dec 2020
Topological Planning with Transformers for Vision-and-Language Navigation
Computer Vision and Pattern Recognition (CVPR), 2020
Kevin Chen
Junshen K. Chen
Jo Chuang
Hao-Tien Lewis Chiang
Silvio Savarese
LM&Ro
218
137
0
09 Dec 2020
Hateful Memes Detection via Complementary Visual and Linguistic Networks
W. Zhang
Guihua Liu
Zhuohua Li
Fuqing Zhu
104
21
0
09 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
266
158
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
275
89
0
08 Dec 2020
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da
Maxwell Forbes
Rowan Zellers
Anthony Zheng
Jena D. Hwang
Antoine Bosselut
Yejin Choi
DiffM
208
5
0
08 Dec 2020
WeaQA: Weak Supervision via Captions for Visual Question Answering
Findings (Findings), 2020
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
335
38
0
04 Dec 2020
Understanding Guided Image Captioning Performance across Domains
Conference on Computational Natural Language Learning (CoNLL), 2020
Edwin G. Ng
Bo Pang
P. Sharma
Radu Soricut
371
28
0
04 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
150
16
0
02 Dec 2020
Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), 2020
Haozheng Luo
Ruiyang Qin
Chenwei Xu
Guo Ye
Zening Luo
469
5
0
01 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Transactions of the Association for Computational Linguistics (TACL), 2020
Emanuele Bugliarello
Robert Bamler
Naoaki Okazaki
Desmond Elliott
251
125
0
30 Nov 2020
Point and Ask: Incorporating Pointing into Visual Question Answering
Arjun Mani
Nobline Yoo
William Fu-Hinthorn
Olga Russakovsky
3DPC
385
42
0
27 Nov 2020
Learning from Lexical Perturbations for Consistent Visual Question Answering
Spencer Whitehead
Hui Wu
Yi R. Fung
Heng Ji
Rogerio Feris
Kate Saenko
153
11
0
26 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
Computer Vision and Pattern Recognition (CVPR), 2020
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
326
382
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
312
73
0
25 Nov 2020
Open-Vocabulary Object Detection Using Captions
Computer Vision and Pattern Recognition (CVPR), 2020
Alireza Zareian
Kevin Dela Rosa
Derek Hao Hu
Shih-Fu Chang
VLM
ObjD
433
538
0
20 Nov 2020
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP Applications
International Conference on Information and Knowledge Management (CIKM), 2020
Minghui Qiu
Peng Li
Chengyu Wang
Hanjie Pan
Yaliang Li
...
Jun Yang
Yaliang Li
Yanjie Liang
Deng Cai
Jialin Li
VLM
SyDa
362
20
0
18 Nov 2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
Bowen Zhang
Hexiang Hu
Joonseok Lee
Mingde Zhao
Sheide Chammas
Vihan Jain
Eugene Ie
Fei Sha
204
39
0
18 Nov 2020
Generating Natural Questions from Images for Multimodal Assistants
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Alkesh Patel
Sudarshan Ramanujam
Hadas Kotek
Christopher Klein
Jason D. Williams
VGen
184
10
0
17 Nov 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Andreas Veit
Kimberly Wilber
72
3
0
17 Nov 2020
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
Vasu Sharma
Gurneet Arora
Navpreet Kaloty
201
39
0
16 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
Computer Vision and Pattern Recognition (CVPR), 2020
Linchao Zhu
Yi Yang
ViT
327
452
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
181
101
0
10 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
217
127
0
10 Nov 2020
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Songlin Yang
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
383
832
0
08 Nov 2020
Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction
Ethan M. Rudd
Ahmed Abdallah
133
7
0
05 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
Yue Wang
Jing Li
Michael R. Lyu
Irwin King
243
21
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Neural Information Processing Systems (NeurIPS), 2020
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
204
178
0
01 Nov 2020
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
Stanislav Frolov
Shailza Jolly
Jörn Hees
Andreas Dengel
EGVM
134
6
0
28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Björn Bebensee
Byoung-Tak Zhang
137
7
0
27 Oct 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Findings (Findings), 2020
Aisha Urooj Khan
Amir Mazaheri
N. Lobo
M. Shah
215
62
0
27 Oct 2020
Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions
Radhika Dua
Sai Srinivas Kancheti
V. Balasubramanian
LRM
266
27
0
24 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li
Haoxuan You
Zhecan Wang
Alireza Zareian
Shih-Fu Chang
Kai-Wei Chang
SSL
VLM
201
12
0
24 Oct 2020
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models
Xian Li
Changhan Wang
Yun Tang
C. Tran
Yuqing Tang
J. Pino
Alexei Baevski
Alexis Conneau
Michael Auli
284
6
0
24 Oct 2020
Can images help recognize entities? A study of the role of images for Multimodal NER
Shuguang Chen
Gustavo Aguilar
Leonardo Neves
Thamar Solorio
EgoV
269
45
0
23 Oct 2020
Previous
1
2
3
...
40
41
42
43
44
45
Next