Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,260 papers shown
Diagnosing Vision-and-Language Navigation: What Really Matters
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
233
55
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Computer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
328
159
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Computer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
347
134
0
30 Mar 2021
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
Xiaosong Wang
Ziyue Xu
Leo K. Tam
Dong Yang
Daguang Xu
ViT
MedIm
134
25
0
30 Mar 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
IEEE International Conference on Computer Vision (ICCV), 2021
Hila Chefer
Shir Gur
Lior Wolf
ViT
354
408
0
29 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
IEEE International Conference on Computer Vision (ICCV), 2021
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
302
373
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
339
165
0
28 Mar 2021
Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models
Applied AI Letters (AA), 2021
Arijit Ray
Michael Cogswell
Xiaoyu Lin
Kamran Alipour
Ajay Divakaran
Yi Yao
Giedrius Burachas
FAtt
153
5
0
26 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
AAAI Conference on Artificial Intelligence (AAAI), 2021
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
377
167
0
23 Mar 2021
Instance-level Image Retrieval using Reranking Transformers
IEEE International Conference on Computer Vision (ICCV), 2021
Fuwen Tan
Jiangbo Yuan
Vicente Ordonez
ViT
354
107
0
22 Mar 2021
MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation
IEEE International Conference on Robotics and Automation (ICRA), 2021
Zachary Seymour
Kowshik Thopalli
Niluthpol Chowdhury Mithun
Han-Pang Chiu
S. Samarasekera
Rakesh Kumar
3DPC
146
20
0
21 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
IEEE International Conference on Computer Vision (ICCV), 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
278
36
0
18 Mar 2021
Few-Shot Visual Grounding for Natural Human-Robot Interaction
Georgios Tziafas
S. Kasaei
195
7
0
17 Mar 2021
Multimodal End-to-End Sparse Model for Emotion Recognition
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wenliang Dai
Samuel Cahyawijaya
Zihan Liu
Pascale Fung
CVBM
230
100
0
17 Mar 2021
Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream Transformer
American Medical Informatics Association Annual Symposium (AMIA), 2021
S. Fouladvand
J. Talbert
L. Dwoskin
H. Bush
A. Meadows
Lars E. Peterson
Ramakanth Kavuluru
Jin Chen
200
5
0
16 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
199
89
0
16 Mar 2021
A Survey on Multimodal Disinformation Detection
International Conference on Computational Linguistics (COLING), 2021
Firoj Alam
S. Cresci
Tanmoy Chakraborty
Fabrizio Silvestri
Dimiter Dimitrov
Giovanni Da San Martino
Shaden Shaar
Hamed Firooz
Preslav Nakov
257
116
0
13 Mar 2021
Unified Pre-training for Program Understanding and Generation
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wasi Uddin Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
417
851
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
International Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
289
49
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Computer Vision and Pattern Recognition (CVPR), 2021
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
224
193
0
05 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
529
390
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Jialin Li
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
345
147
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go
ACM Computing Surveys (CSUR), 2021
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
264
71
0
27 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
Ronghang Hu
Amanpreet Singh
ViT
358
343
0
22 Feb 2021
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer
IEEE International Conference on Document Analysis and Recognition (ICDAR), 2021
Rafal Powalski
Łukasz Borchmann
Dawid Jurkiewicz
Tomasz Dwojak
Michal Pietruszka
Gabriela Pałka
ViT
356
184
0
18 Feb 2021
Hierarchical Similarity Learning for Language-based Product Image Retrieval
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
Zhe Ma
Fenghao Liu
Jianfeng Dong
Xiaoye Qu
Yuan He
S. Ji
VLM
154
7
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Computer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,360
0
17 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
International Conference on Learning Representations (ICLR), 2021
Irwan Bello
509
187
0
17 Feb 2021
Biomedical Question Answering: A Survey of Approaches and Challenges
ACM Computing Surveys (CSUR), 2021
Qiao Jin
Zheng Yuan
Guangzhi Xiong
Qian Yu
Huaiyuan Ying
Chuanqi Tan
Mosha Chen
Songfang Huang
Xiaozhong Liu
Sheng Yu
250
123
0
10 Feb 2021
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Linwei Ye
Mrigank Rochan
Zhi Liu
Xiaoqin Zhang
Yang Wang
VOS
EgoV
131
65
0
09 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
176
12
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
International Conference on Machine Learning (ICML), 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
547
2,107
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
AAAI Conference on Artificial Intelligence (AAAI), 2021
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
159
172
0
05 Feb 2021
Inferring spatial relations from textual descriptions of images
Pattern Recognition (Pattern Recogn.), 2021
A. Elu
Gorka Azkune
Oier López de Lacalle
Ignacio Arganda-Carreras
Aitor Soroa Etxabe
Eneko Agirre
139
2
0
01 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Transactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
238
126
0
31 Jan 2021
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Alessandro Suglia
Yonatan Bisk
Ioannis Konstas
Antonio Vergari
E. Bastianelli
Andrea Vanzo
Oliver Lemon
138
8
0
31 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
AAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
157
58
0
27 Jan 2021
Adversarial Text-to-Image Synthesis: A Review
Neural Networks (NN), 2021
Stanislav Frolov
Tobias Hinz
Federico Raue
Jörn Hees
Andreas Dengel
EGVM
322
201
0
25 Jan 2021
Latent Variable Models for Visual Question Answering
Zixu Wang
Yishu Miao
Lucia Specia
237
5
0
16 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
213
31
0
15 Jan 2021
Latent Alignment of Procedural Concepts in Multimodal Recipes
Hossein Rajaby Faghihi
Roshanak Mirzaee
Sudarshan Paliwal
Parisa Kordjamshidi
112
3
0
12 Jan 2021
MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Woojeong Jin
Maziar Sanjabi
Shaoliang Nie
L Tan
Xiang Ren
Hamed Firooz
163
6
0
06 Jan 2021
Transformers in Vision: A Survey
ACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
924
3,176
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjD
VLM
513
168
0
02 Jan 2021
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Xiaopeng Lu
Tiancheng Zhao
Kyusong Lee
268
29
0
01 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
795
406
0
31 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
371
31
0
30 Dec 2020
Detecting Hate Speech in Multi-modal Memes
Abhishek Das
Japsimar Singh Wahi
Siyao Li
136
75
0
29 Dec 2020
Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge
Riza Velioglu
J. Rose
VLM
121
103
0
23 Dec 2020
Training data-efficient image transformers & distillation through attention
International Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
649
8,277
0
23 Dec 2020
Previous
1
2
3
...
22
23
24
25
26
Next