ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown
Productivity, Portability, Performance: Data-Centric Python
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
412
112
0
01 Jul 2021
GlyphCRM: Bidirectional Encoder Representation for Chinese Character
  with its Glyph
GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph
Yunxin Li
Yu Zhao
Baotian Hu
Qingcai Chen
Yang Xiang
Xiaolong Wang
Yuxin Ding
Lin Ma
121
8
0
01 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
  Generation
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
299
41
0
01 Jul 2021
Attention Bottlenecks for Multimodal Fusion
Attention Bottlenecks for Multimodal FusionNeural Information Processing Systems (NeurIPS), 2021
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
577
698
0
30 Jun 2021
Probing Inter-modality: Visual Parsing with Self-Attention for
  Vision-Language Pre-training
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
406
93
0
25 Jun 2021
A Picture May Be Worth a Hundred Words for Visual Question Answering
A Picture May Be Worth a Hundred Words for Visual Question Answering
Yusuke Hirota
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Ittetsu Taniguchi
Takao Onoye
ViT
145
4
0
25 Jun 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training
  for VQA Challenge 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021
Keda Lu
Bo Fang
Kuan-Yu Chen
ViT
96
2
0
24 Jun 2021
DocFormer: End-to-End Transformer for Document Understanding
DocFormer: End-to-End Transformer for Document UnderstandingIEEE International Conference on Computer Vision (ICCV), 2021
Srikar Appalaraju
Bhavan A. Jasani
Bhargava Urala Kota
Yusheng Xie
R. Manmatha
ViT
348
346
0
22 Jun 2021
AOMD: An Analogy-aware Approach to Offensive Meme Detection on Social
  Media
AOMD: An Analogy-aware Approach to Offensive Meme Detection on Social MediaInformation Processing & Management (IPM), 2021
Lanyu Shang
Yang Zhang
Yuheng Zha
Yingxi Chen
Christina Youn
Dong Wang
109
27
0
21 Jun 2021
Efficient Self-supervised Vision Transformers for Representation
  Learning
Efficient Self-supervised Vision Transformers for Representation LearningInternational Conference on Learning Representations (ICLR), 2021
Chunyuan Li
Jianwei Yang
Pengchuan Zhang
Mei Gao
Bin Xiao
Xiyang Dai
Lu Yuan
Jianfeng Gao
ViT
302
222
0
17 Jun 2021
Probing Image-Language Transformers for Verb Understanding
Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks
Aida Nematzadeh
214
131
0
16 Jun 2021
Pre-Trained Models: Past, Present and Future
Pre-Trained Models: Past, Present and FutureAI Open (AO), 2021
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFinMQAI4MH
385
990
0
14 Jun 2021
Deciphering Implicit Hate: Evaluating Automated Detection Algorithms for
  Multimodal Hate
Deciphering Implicit Hate: Evaluating Automated Detection Algorithms for Multimodal HateFindings (Findings), 2021
Austin Botelho
Bertie Vidgen
Scott A. Hale
99
14
0
10 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Keeping Your Eye on the Ball: Trajectory Attention in Video TransformersNeural Information Processing Systems (NeurIPS), 2021
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
283
340
0
09 Jun 2021
Check It Again: Progressive Visual Question Answering via Visual
  Entailment
Check It Again: Progressive Visual Question Answering via Visual EntailmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Q. Si
Zheng Lin
Mingyu Zheng
Peng Fu
Weiping Wang
151
52
0
08 Jun 2021
A Survey of Transformers
A Survey of TransformersAI Open (AO), 2021
Tianyang Lin
Yuxin Wang
Xiangyang Liu
Xipeng Qiu
ViT
445
1,386
0
08 Jun 2021
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
Chasing Sparsity in Vision Transformers: An End-to-End ExplorationNeural Information Processing Systems (NeurIPS), 2021
Tianlong Chen
Yu Cheng
Zhe Gan
Lu Yuan
Lei Zhang
Zinan Lin
ViT
254
255
0
08 Jun 2021
Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused
  Interventions
Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused InterventionsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Daniel Rosenberg
Itai Gat
Amir Feder
Roi Reichart
AAML
276
16
0
08 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
MERLOT: Multimodal Neural Script Knowledge ModelsNeural Information Processing Systems (NeurIPS), 2021
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLMLRM
348
428
0
04 Jun 2021
Human-Adversarial Visual Question Answering
Human-Adversarial Visual Question AnsweringNeural Information Processing Systems (NeurIPS), 2021
Sasha Sheng
Amanpreet Singh
Vedanuj Goswami
Jose Alberto Lopez Magana
Wojciech Galuba
Devi Parikh
Douwe Kiela
OODEgoVAAML
122
69
0
04 Jun 2021
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
  Learning
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Haiyang Xu
Ming Yan
Chenliang Li
Bin Bi
Songfang Huang
Wenming Xiao
Fei Huang
VLM
316
127
0
03 Jun 2021
Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and
  Images using Textual and Multimodal Ensemble
Volta at SemEval-2021 Task 6: Towards Detecting Persuasive Texts and Images using Textual and Multimodal EnsembleInternational Workshop on Semantic Evaluation (SemEval), 2021
Kshitij Gupta
Devansh Gautam
R. Mamidi
101
15
0
01 Jun 2021
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Shuhe Wang
Yuxian Meng
Xiaofei Sun
Leilei Gan
Rongbin Ouyang
Rui Yan
Tianwei Zhang
Jiwei Li
220
15
0
30 May 2021
Rethinking the constraints of multimodal fusion: case study in
  Weakly-Supervised Audio-Visual Video Parsing
Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing
Jianning Wu
Zhuqing Jiang
S. Wen
Aidong Men
Haiying Wang
223
1
0
30 May 2021
Enhance Multimodal Model Performance with Data Augmentation: Facebook
  Hateful Meme Challenge Solution
Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution
Yang Li
Zi-xin Zhang
Hutchin Huang
173
1
0
25 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via
  Vision-Language Pre-Training
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
226
211
0
24 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video
  Understanding
VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
327
146
0
20 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic
  Survey
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
831
322
0
10 May 2021
Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads
Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's HeadsInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Chenyu Gao
Qi Zhu
Peng Wang
Qi Wu
105
2
0
30 Apr 2021
Multimodal Contrastive Training for Visual Representation Learning
Multimodal Contrastive Training for Visual Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Xin Yuan
Zhe Lin
Jason Kuen
Jianming Zhang
Yilin Wang
Michael Maire
Ajinkya Kale
Baldo Faieta
SSL
240
191
0
26 Apr 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingIEEE International Conference on Computer Vision (ICCV), 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjDVLM
637
1,055
0
26 Apr 2021
InfographicVQA
InfographicVQAIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Minesh Mathew
Viraj Bagal
Rubèn Pérez Tito
Dimosthenis Karatzas
Ernest Valveny
C. V. Jawahar
378
370
0
26 Apr 2021
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and
  Images
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and ImagesInternational Workshop on Semantic Evaluation (SemEval), 2021
Dimitar Dimitrov
Bishr Bin Ali
Shaden Shaar
Firoj Alam
Fabrizio Silvestri
Hamed Firooz
Preslav Nakov
Giovanni Da San Martino
147
120
0
25 Apr 2021
MusCaps: Generating Captions for Music Audio
MusCaps: Generating Captions for Music AudioIEEE International Joint Conference on Neural Network (IJCNN), 2021
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
284
43
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
303
62
0
23 Apr 2021
Multiscale Vision Transformers
Multiscale Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
481
1,513
0
22 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
186
31
0
20 Apr 2021
BM-NAS: Bilevel Multimodal Neural Architecture Search
BM-NAS: Bilevel Multimodal Neural Architecture SearchAAAI Conference on Artificial Intelligence (AAAI), 2021
Yihang Yin
Siyu Huang
Xiang Zhang
232
34
0
19 Apr 2021
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich
  Document Understanding
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Yiheng Xu
Tengchao Lv
Lei Cui
Guoxin Wang
Yijuan Lu
D. Florêncio
Cha Zhang
Furu Wei
MLLMVLM
261
167
0
18 Apr 2021
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
Te-Lin Wu
Cheng-rong Li
Mingyang Zhang
Tao Chen
Spurthi Amba Hombaiah
Michael Bendersky
147
15
0
16 Apr 2021
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Cross-Modal Retrieval Augmentation for Multi-Modal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Shir Gur
Natalia Neverova
C. Stauffer
Ser-Nam Lim
Douwe Kiela
A. Reiter
217
36
0
16 Apr 2021
Effect of Visual Extensions on Natural Language Understanding in
  Vision-and-Language Models
Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Taichi Iki
Akiko Aizawa
VLM
234
21
0
16 Apr 2021
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal MediaConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Grace Luo
Trevor Darrell
Anna Rohrbach
247
127
0
13 Apr 2021
Non-autoregressive Transformer-based End-to-end ASR using BERT
Non-autoregressive Transformer-based End-to-end ASR using BERTIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Fu-Hao Yu
Kuan-Yu Chen
141
32
0
10 Apr 2021
How Transferable are Reasoning Patterns in VQA?
How Transferable are Reasoning Patterns in VQA?Computer Vision and Pattern Recognition (CVPR), 2021
Corentin Kervadec
Theo Jaunet
G. Antipov
M. Baccouche
Romain Vuillemot
Christian Wolf
LRM
149
29
0
08 Apr 2021
Multimodal Fusion Refiner Networks
Multimodal Fusion Refiner Networks
Sethuraman Sankaran
David Yang
Ser-Nam Lim
OffRL
172
8
0
08 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
  Representation Learning
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLMViT
425
303
0
07 Apr 2021
Towards General Purpose Vision Systems
Towards General Purpose Vision SystemsComputer Vision and Pattern Recognition (CVPR), 2021
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
275
55
0
01 Apr 2021
Zero-Shot Language Transfer vs Iterative Back Translation for
  Unsupervised Machine Translation
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
Aviral Joshi
Chengzhi Huang
H. Singh
157
2
0
31 Mar 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN ImageryIEEE International Conference on Computer Vision (ICCV), 2021
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIPVLM
390
1,369
0
31 Mar 2021
Previous
123...212223242526
Next