ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,257 papers shown
Title
Multimodal Text Style Transfer for Outdoor Vision-and-Language
  Navigation
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Wanrong Zhu
Xinze Wang
Tsu-Jui Fu
An Yan
P. Narayana
Kazoo Sone
Sugato Basu
Wenjie Wang
323
38
0
01 Jul 2020
Modality-Agnostic Attention Fusion for visual search with text feedback
Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds
Jack Culpepper
Simão Herdade
Yang Zhang
K. Boakye
EgoV
233
84
0
30 Jun 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
  Scene Graph
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
352
399
0
30 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjDSSL
264
157
0
17 Jun 2020
VirTex: Learning Visual Representations from Textual Annotations
VirTex: Learning Visual Representations from Textual AnnotationsComputer Vision and Pattern Recognition (CVPR), 2020
Karan Desai
Justin Johnson
SSLVLM
416
465
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation
  Learning
Large-Scale Adversarial Training for Vision-and-Language Representation LearningNeural Information Processing Systems (NeurIPS), 2020
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjDVLM
338
535
0
11 Jun 2020
TRIE: End-to-End Text Reading and Information Extraction for Document
  Understanding
TRIE: End-to-End Text Reading and Information Extraction for Document UnderstandingACM Multimedia (ACM MM), 2020
Peng Zhang
Yunlu Xu
Zhanzhan Cheng
Shiliang Pu
Jing Lu
Liang Qiao
Yi Niu
Leilei Gan
SyDa
243
108
0
27 May 2020
Adaptive Transformers for Learning Multimodal Representations
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava
103
5
0
15 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained
  Vision-and-Language Models
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
248
137
0
15 May 2020
Cross-Modality Relevance for Reasoning on Language and Vision
Cross-Modality Relevance for Reasoning on Language and Vision
Chen Zheng
Quan Guo
Parisa Kordjamshidi
LRM
130
37
0
12 May 2020
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes
Douwe Kiela
Hamed Firooz
Aravind Mohan
Vedanuj Goswami
Amanpreet Singh
Pratik Ringshia
Davide Testuggine
266
744
0
10 May 2020
MISA: Modality-Invariant and -Specific Representations for Multimodal
  Sentiment Analysis
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Devamanyu Hazarika
Roger Zimmermann
Soujanya Poria
277
942
0
07 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Cross-media Structured Common Space for Multimedia Event ExtractionAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Pengfei Yu
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
151
115
0
05 May 2020
Visuo-Linguistic Question Answering (VLQA) Challenge
Visuo-Linguistic Question Answering (VLQA) Challenge
Shailaja Keyur Sampat
Yezhou Yang
Chitta Baral
CoGe
125
1
0
01 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
597
535
0
01 May 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the
  Web
Improving Vision-and-Language Navigation with Image-Text Pairs from the WebEuropean Conference on Computer Vision (ECCV), 2020
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
400
256
0
30 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
VD-BERT: A Unified Vision and Dialog Transformer with BERTConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
299
106
0
28 Apr 2020
Deep Multimodal Neural Architecture Search
Deep Multimodal Neural Architecture SearchACM Multimedia (ACM MM), 2020
Zhou Yu
Yuhao Cui
Jun-chen Yu
Meng Wang
Dacheng Tao
Qi Tian
149
107
0
25 Apr 2020
Experience Grounds Language
Experience Grounds Language
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Jacob Andreas
Yoshua Bengio
...
Angeliki Lazaridou
Jonathan May
Aleksandr Nisnevich
Nicolas Pinto
Joseph P. Turian
431
395
0
21 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic
  pretraining
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
140
48
0
19 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
651
2,122
0
13 Apr 2020
Multimodal Categorization of Crisis Events in Social Media
Multimodal Categorization of Crisis Events in Social MediaComputer Vision and Pattern Recognition (CVPR), 2020
Mahdi Abavisani
Liwei Wu
Shengli Hu
Joel R. Tetreault
A. Jaimes
236
110
0
10 Apr 2020
Learning to Scale Multilingual Representations for Vision-Language Tasks
Learning to Scale Multilingual Representations for Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2020
Andrea Burns
Donghyun Kim
Derry Wijaya
Kate Saenko
Bryan A. Plummer
178
36
0
09 Apr 2020
Context-Aware Group Captioning via Self-Attention and Contrastive
  Features
Context-Aware Group Captioning via Self-Attention and Contrastive FeaturesComputer Vision and Pattern Recognition (CVPR), 2020
Zhuowan Li
Quan Hung Tran
Long Mai
Zhe Lin
Alan Yuille
VLM
159
50
0
07 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
346
467
0
02 Apr 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A SurveyScience China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MAVLM
917
1,602
0
18 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
XGPT: Cross-modal Generative Pre-Training for Image CaptioningNatural Language Processing and Chinese Computing (NLPCC), 2020
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLMVLM
211
84
0
03 Mar 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
187
7
0
25 Feb 2020
Measuring Social Biases in Grounded Vision and Language Embeddings
Measuring Social Biases in Grounded Vision and Language EmbeddingsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2020
Candace Ross
Boris Katz
Andrei Barbu
277
69
0
20 Feb 2020
Robustness Verification for Transformers
Robustness Verification for TransformersInternational Conference on Learning Representations (ICLR), 2020
Zhouxing Shi
Huan Zhang
Kai-Wei Chang
Shiyu Huang
Cho-Jui Hsieh
AAML
182
123
0
16 Feb 2020
UniVL: A Unified Video and Language Pre-Training Model for Multimodal
  Understanding and Generation
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
Huaishao Luo
Lei Ji
Ding Wang
Haoyang Huang
Nan Duan
Tianrui Li
Jason Li
Xilin Chen
Ming Zhou
VLM
312
419
0
15 Feb 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised
  Image-Text Data
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
329
275
0
22 Jan 2020
In Defense of Grid Features for Visual Question Answering
In Defense of Grid Features for Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2020
Huaizu Jiang
Ishan Misra
Marcus Rohrbach
Erik Learned-Miller
Xinlei Chen
OODObjD
295
352
0
10 Jan 2020
Visual Question Answering on 360° Images
Visual Question Answering on 360° ImagesIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2020
Shih-Han Chou
Wei-Lun Chao
Wei-Sheng Lai
Min Sun
Ming-Hsuan Yang
138
27
0
10 Jan 2020
All-in-One Image-Grounded Conversational Agents
All-in-One Image-Grounded Conversational Agents
Da Ju
Kurt Shuster
Y-Lan Boureau
Jason Weston
LLMAG
133
9
0
28 Dec 2019
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art
  Baseline
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art BaselineEuropean Conference on Computer Vision (ECCV), 2019
Vishvak Murahari
Dhruv Batra
Devi Parikh
Abhishek Das
VLM
292
119
0
05 Dec 2019
15 Keypoints Is All You Need
15 Keypoints Is All You NeedComputer Vision and Pattern Recognition (CVPR), 2019
Michael Snower
Asim Kadav
Farley Lai
H. Graf
VOT3DH
262
50
0
05 Dec 2019
12-in-1: Multi-Task Vision and Language Representation Learning
12-in-1: Multi-Task Vision and Language Representation LearningComputer Vision and Pattern Recognition (CVPR), 2019
Jiasen Lu
Vedanuj Goswami
Marcus Rohrbach
Devi Parikh
Stefan Lee
VLMObjD
279
499
0
05 Dec 2019
Efficient Attention Mechanism for Visual Dialog that can Handle All the
  Interactions between Multiple Inputs
Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs
Van-Quang Nguyen
Masanori Suganuma
Takayuki Okatani
247
7
0
26 Nov 2019
Learning to Learn Words from Visual Scenes
Learning to Learn Words from Visual Scenes
Dídac Surís
Dave Epstein
Heng Ji
Shih-Fu Chang
Carl Vondrick
VLMCLIPSSLOffRL
129
4
0
25 Nov 2019
Iterative Answer Prediction with Pointer-Augmented Multimodal
  Transformers for TextVQA
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQAComputer Vision and Pattern Recognition (CVPR), 2019
Ronghang Hu
Amanpreet Singh
Trevor Darrell
Marcus Rohrbach
289
222
0
14 Nov 2019
Attention on Abstract Visual Reasoning
Attention on Abstract Visual Reasoning
Lukas Hahne
Timo Lüddecke
Florentin Wörgötter
David Kappel
GNN
122
23
0
14 Nov 2019
Multimodal Intelligence: Representation Learning, Information Fusion,
  and Applications
Multimodal Intelligence: Representation Learning, Information Fusion, and ApplicationsIEEE Journal on Selected Topics in Signal Processing (JSTSP), 2019
Chao Zhang
Zichao Yang
Xiaodong He
Li Deng
HAIAI4TS
275
396
0
10 Nov 2019
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded
  Conversational Agents
The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2019
Kurt Shuster
Da Ju
Stephen Roller
Emily Dinan
Y-Lan Boureau
Jason Weston
226
84
0
09 Nov 2019
Contextual Grounding of Natural Language Entities in Images
Contextual Grounding of Natural Language Entities in Images
Farley Lai
Ning Xie
Derek Doran
Asim Kadav
ObjD
99
6
0
05 Nov 2019
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning
  Baselines
TAB-VCR: Tags and Attributes based Visual Commonsense Reasoning Baselines
Jingxiang Lin
Unnat Jain
Alex Schwing
LRMReLM
284
10
0
31 Oct 2019
Good, Better, Best: Textual Distractors Generation for Multiple-Choice
  Visual Question Answering via Reinforcement Learning
Good, Better, Best: Textual Distractors Generation for Multiple-Choice Visual Question Answering via Reinforcement Learning
Jiaying Lu
Xin Ye
Yi Ren
Yezhou Yang
181
10
0
21 Oct 2019
UNITER: UNiversal Image-TExt Representation Learning
UNITER: UNiversal Image-TExt Representation LearningEuropean Conference on Computer Vision (ECCV), 2019
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El Kholy
Faisal Ahmed
Zhe Gan
Yu Cheng
Jingjing Liu
VLMOT
321
463
0
25 Sep 2019
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQAAAAI Conference on Artificial Intelligence (AAAI), 2019
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLMVLM
601
1,005
0
24 Sep 2019
NLVR2 Visual Bias Analysis
NLVR2 Visual Bias Analysis
Alane Suhr
Yoav Artzi
68
18
0
23 Sep 2019
Previous
123...242526
Next