VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown

Diagnosing Vision-and-Language Navigation: What Really MattersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

Qi Wu

233

30 Mar 2021

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with TransformersComputer Vision and Pattern Recognition (CVPR), 2021

Antoine Miech

Jean-Baptiste Alayrac

328

159

30 Mar 2021

Kaleido-BERT: Vision-Language Pre-training on Fashion DomainComputer Vision and Pattern Recognition (CVPR), 2021

347

134

30 Mar 2021

Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays

Xiaosong Wang

Ziyue Xu

134

30 Mar 2021

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder TransformersIEEE International Conference on Computer Vision (ICCV), 2021

354

408

29 Mar 2021

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image EncodingIEEE International Conference on Computer Vision (ICCV), 2021

Pengchuan Zhang

Xiyang Dai

Jianwei Yang

Bin Xiao

Lu Yuan

Lei Zhang

Jianfeng Gao

ViT

302

373

29 Mar 2021

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2021

339

165

28 Mar 2021

Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA ModelsApplied AI Letters (AA), 2021

153

26 Mar 2021

Multi-Modal Answer Validation for Knowledge-Based VQAAAAI Conference on Artificial Intelligence (AAAI), 2021

Jialin Wu

Jiasen Lu

Ashish Sabharwal

Roozbeh Mottaghi

377

167

23 Mar 2021

Instance-level Image Retrieval using Reranking TransformersIEEE International Conference on Computer Vision (ICCV), 2021

354

107

22 Mar 2021

MaAST: Map Attention with Semantic Transformersfor Efficient Visual NavigationIEEE International Conference on Robotics and Automation (ICRA), 2021

Zachary Seymour

Kowshik Thopalli

Niluthpol Chowdhury Mithun

146

21 Mar 2021

Space-Time Crop & Attend: Improving Cross-modal Video Representation LearningIEEE International Conference on Computer Vision (ICCV), 2021

Joao Henriques

Andrea Vedaldi

AI4TS

278

18 Mar 2021

Few-Shot Visual Grounding for Natural Human-Robot Interaction

Georgios Tziafas

S. Kasaei

195

17 Mar 2021

Multimodal End-to-End Sparse Model for Emotion RecognitionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

230

100

17 Mar 2021

Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream TransformerAmerican Medical Informatics Association Annual Symposium (AMIA), 2021

200

16 Mar 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

199

16 Mar 2021

A Survey on Multimodal Disinformation DetectionInternational Conference on Computational Linguistics (COLING), 2021

Firoj Alam

S. Cresci

Tanmoy Chakraborty

Fabrizio Silvestri

Dimiter Dimitrov

Giovanni Da San Martino

Shaden Shaar

Hamed Firooz

Preslav Nakov

257

116

13 Mar 2021

Unified Pre-training for Program Understanding and GenerationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

417

851

10 Mar 2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021

Andrew Shin

Masato Ishii

T. Narihira

289

06 Mar 2021

Causal Attention for Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2021

Jianfei Cai

224

193

05 Mar 2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021

529

390

02 Mar 2021

M6: A Chinese Multimodal Pretrainer

Rui Men

...

Yong Li

Jialin Li

Jingren Zhou

J. Tang

Hongxia Yang

VLM MoE

345

147

01 Mar 2021

Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts GoACM Computing Surveys (CSUR), 2021

Arnav Arora

Preslav Nakov

Momchil Hardalov

Sheikh Muhammad Sarwar

...

264

27 Feb 2021

UniT: Multimodal Multitask Learning with a Unified TransformerIEEE International Conference on Computer Vision (ICCV), 2021

Ronghang Hu

Amanpreet Singh

ViT

358

343

22 Feb 2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout TransformerIEEE International Conference on Document Analysis and Recognition (ICDAR), 2021

356

184

18 Feb 2021

Hierarchical Similarity Learning for Language-based Product Image RetrievalIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

154

18 Feb 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021

1.1K

1,360

17 Feb 2021

LambdaNetworks: Modeling Long-Range Interactions Without AttentionInternational Conference on Learning Representations (ICLR), 2021

Irwan Bello

509

187

17 Feb 2021

Biomedical Question Answering: A Survey of Approaches and ChallengesACM Computing Surveys (CSUR), 2021

Chuanqi Tan

Xiaozhong Liu

250

123

10 Feb 2021

Referring Segmentation in Images and Videos with Cross-Modal Self-Attention NetworkIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

131

09 Feb 2021

CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language ModelsIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021

Yusheng Su

Xu Han

Yankai Lin

Zhengyan Zhang

Zhiyuan Liu

Peng Li

Jie Zhou

Maosong Sun

176

07 Feb 2021

ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionInternational Conference on Machine Learning (ICML), 2021

547

2,107

05 Feb 2021

RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NERAAAI Conference on Artificial Intelligence (AAAI), 2021

159

172

05 Feb 2021

Inferring spatial relations from textual descriptions of imagesPattern Recognition (Pattern Recogn.), 2021

A. Elu

Gorka Azkune

Oier López de Lacalle

Ignacio Arganda-Carreras

Aitor Soroa Etxabe

Eneko Agirre

139

01 Feb 2021

Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021

Lisa Anne Hendricks

John F. J. Mellor

R. Schneider

Jean-Baptiste Alayrac

Aida Nematzadeh

238

126

31 Jan 2021

An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing GamesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021

138

31 Jan 2021

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder NetworkAAAI Conference on Artificial Intelligence (AAAI), 2021

Yingwei Pan

Tao Mei

157

27 Jan 2021

Adversarial Text-to-Image Synthesis: A ReviewNeural Networks (NN), 2021

322

201

25 Jan 2021

Latent Variable Models for Visual Question Answering

Zixu Wang

Yishu Miao

Lucia Specia

237

16 Jan 2021

Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

213

15 Jan 2021

Latent Alignment of Procedural Concepts in Multimodal Recipes

Hossein Rajaby Faghihi

Roshanak Mirzaee

Sudarshan Paliwal

Parisa Kordjamshidi

112

12 Jan 2021

MSD: Saliency-aware Knowledge Distillation for Multimodal UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Xiang Ren

163

06 Jan 2021

Transformers in Vision: A SurveyACM Computing Surveys (CSUR), 2021

Salman Khan

924

3,176

04 Jan 2021

VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang

Xiujun Li

Xiaowei Hu

Jianwei Yang

Lei Zhang

Lijuan Wang

Yejin Choi

Jianfeng Gao

ObjD VLM

513

168

02 Jan 2021

VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-wordsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Xiaopeng Lu

Tiancheng Zhao

Kyusong Lee

268

01 Jan 2021

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

795

406

31 Dec 2020

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

Rui Yan

Jiwei Li

371

30 Dec 2020

Detecting Hate Speech in Multi-modal Memes

Abhishek Das

Japsimar Singh Wahi

Siyao Li

136

29 Dec 2020

Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge

Riza Velioglu

J. Rose

VLM

121

103

23 Dec 2020

Training data-efficient image transformers & distillation through attentionInternational Conference on Machine Learning (ICML), 2020

Alexandre Sablayrolles

Edouard Grave

ViT

649

8,277

23 Dec 2020