VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown

A Multimodal Framework for the Detection of Hateful Memes

283

23 Dec 2020

A Survey on Visual TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020

...

1.0K

3,062

23 Dec 2020

Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks

330

22 Dec 2020

ActionBert: Leveraging User Actions for Semantic Understanding of User InterfacesAAAI Conference on Artificial Intelligence (AAAI), 2020

Blaise Agüera y Arcas

261

22 Dec 2020

KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQAComputer Vision and Pattern Recognition (CVPR), 2020

Devi Parikh

272

225

20 Dec 2020

MELINDA: A Multimodal Dataset for Biomedical Experiment Method ClassificationAAAI Conference on Artificial Intelligence (AAAI), 2020

111

16 Dec 2020

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

263

15 Dec 2020

Attention over learned object embeddings enables complex visual reasoningNeural Information Processing Systems (NeurIPS), 2020

366

15 Dec 2020

Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes

Niklas Muennighoff

151

14 Dec 2020

KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense ReasoningKnowledge-Based Systems (KBS), 2020

Dandan Song

253

13 Dec 2020

MiniVLM: A Smaller and Faster Vision-Language Model

Xiaowei Hu

Zicheng Liu

235

13 Dec 2020

Hateful Memes Detection via Complementary Visual and Linguistic Networks

104

09 Dec 2020

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

Lei Zhang

263

158

08 Dec 2020

Parameter Efficient Multimodal Transformers for Video Representation Learning

272

08 Dec 2020

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Yejin Choi

208

08 Dec 2020

Neurosymbolic AI for Situated Language Understanding

Nikhil Krishnaswamy

James Pustejovsky

NAI

159

05 Dec 2020

Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge

Xiayu Zhong

148

02 Dec 2020

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTsTransactions of the Association for Computational Linguistics (TACL), 2020

250

125

30 Nov 2020

Learning from Lexical Perturbations for Consistent Visual Question Answering

Heng Ji

149

26 Nov 2020

A Recurrent Vision-and-Language BERT for NavigationComputer Vision and Pattern Recognition (CVPR), 2020

Yicong Hong

Qi Wu

Yuankai Qi

Cristian Rodriguez-Opazo

Stephen Gould

LM&Ro

325

382

26 Nov 2020

Adversarial Evaluation of Multimodal Models under Realistic Gray Box Assumption

Cristian Canton Ferrer

AAML

146

25 Nov 2020

Multimodal Learning for Hateful Memes Detection

Yi Zhou

Zhenhao Chen

305

25 Nov 2020

Open-Vocabulary Object Detection Using CaptionsComputer Vision and Pattern Recognition (CVPR), 2020

Derek Hao Hu

433

535

20 Nov 2020

Improving Calibration in Deep Metric Learning With Cross-Example Softmax

Andreas Veit

Kimberly Wilber

17 Nov 2020

Transductive Zero-Shot Learning using Cross-Modal CycleGAN

228

13 Nov 2020

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

Zongheng Tang

216

127

10 Nov 2020

Multi-document Summarization via Deep Learning Techniques: A Survey

Hu Wang

361

152

10 Nov 2020

CapWAP: Captioning with a Purpose

138

09 Nov 2020

Utilizing Every Image Object for Semi-supervised Phrase Grounding

152

05 Nov 2020

COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningNeural Information Processing Systems (NeurIPS), 2020

Simon Ging

Mohammadreza Zolfaghari

Hamed Pirsiavash

Thomas Brox

ViT CLIP

204

178

01 Nov 2020

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question AnsweringFindings (Findings), 2020

213

27 Oct 2020

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

201

24 Oct 2020

Can images help recognize entities? A study of the role of images for Multimodal NER

266

23 Oct 2020

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy

...

1.4K

55,030

22 Oct 2020

Multimodal Research in Vision and Language: A Review of Current and Emerging Trends

Roger Zimmermann

277

19 Oct 2020

Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question AnsweringInternational Conference on Pattern Recognition (ICPR), 2020

130

17 Oct 2020

Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

137

16 Oct 2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Hao Tan

Joey Tianyi Zhou

CLIP

200

129

14 Oct 2020

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Fuli Luo

Pengcheng Yang

Shicheng Li

Xuancheng Ren

Xu Sun

VLM SSL

212

13 Oct 2020

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

307

12 Oct 2020

Beyond Language: Learning Commonsense from Images for ReasoningFindings (Findings), 2020

Liang Pang

143

10 Oct 2020

Learning to Represent Image and Text with Denotation Graph

160

06 Oct 2020

Support-set bottlenecks for video-text representation learning

Mandela Patrick

Po-Yao (Bernie) Huang

Yuki M. Asano

Florian Metze

Alexander G. Hauptmann

João Henriques

Andrea Vedaldi

341

260

06 Oct 2020

Pathological Visual Question Answering

294

06 Oct 2020

Multi-Modal Open-Domain DialogueConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Jason Weston

287

02 Oct 2020

A Multimodal Memes Classification: A Survey and Open Research Issues

210

17 Sep 2020

Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations

Meng-Jiun Chiou

Roger Zimmermann

Jiashi Feng

246

10 Sep 2020

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and ReportsIEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2020

Yikuan Li

Hanyin Wang

Yuan Luo

130

03 Sep 2020

Active Contrastive Learning of Audio-Visual Video Representations

168

31 Aug 2020

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

Zhou Zhao

Hongxia Yang

212

16 Aug 2020