ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown
Diagnosing Vision-and-Language Navigation: What Really Matters
Diagnosing Vision-and-Language Navigation: What Really MattersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
233
55
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
  Transformers
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with TransformersComputer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
328
159
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Kaleido-BERT: Vision-Language Pre-training on Fashion DomainComputer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
347
134
0
30 Mar 2021
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
Xiaosong Wang
Ziyue Xu
Leo K. Tam
Dong Yang
Daguang Xu
ViTMedIm
134
25
0
30 Mar 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and
  Encoder-Decoder Transformers
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Hila Chefer
Shir Gur
Lior Wolf
ViT
354
408
0
29 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for
  High-Resolution Image Encoding
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image EncodingIEEE International Conference on Computer Vision (ICCV), 2021
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
302
373
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
  Retrieval
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
339
165
0
28 Mar 2021
Generating and Evaluating Explanations of Attended and Error-Inducing
  Input Regions for VQA Models
Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA ModelsApplied AI Letters (AA), 2021
Arijit Ray
Michael Cogswell
Xiaoyu Lin
Kamran Alipour
Ajay Divakaran
Yi Yao
Giedrius Burachas
FAtt
153
5
0
26 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
Multi-Modal Answer Validation for Knowledge-Based VQAAAAI Conference on Artificial Intelligence (AAAI), 2021
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
377
167
0
23 Mar 2021
Instance-level Image Retrieval using Reranking Transformers
Instance-level Image Retrieval using Reranking TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Fuwen Tan
Jiangbo Yuan
Vicente Ordonez
ViT
354
107
0
22 Mar 2021
MaAST: Map Attention with Semantic Transformersfor Efficient Visual
  Navigation
MaAST: Map Attention with Semantic Transformersfor Efficient Visual NavigationIEEE International Conference on Robotics and Automation (ICRA), 2021
Zachary Seymour
Kowshik Thopalli
Niluthpol Chowdhury Mithun
Han-Pang Chiu
S. Samarasekera
Rakesh Kumar
3DPC
146
20
0
21 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation
  Learning
Space-Time Crop & Attend: Improving Cross-modal Video Representation LearningIEEE International Conference on Computer Vision (ICCV), 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
278
36
0
18 Mar 2021
Few-Shot Visual Grounding for Natural Human-Robot Interaction
Few-Shot Visual Grounding for Natural Human-Robot Interaction
Georgios Tziafas
S. Kasaei
195
7
0
17 Mar 2021
Multimodal End-to-End Sparse Model for Emotion Recognition
Multimodal End-to-End Sparse Model for Emotion RecognitionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wenliang Dai
Samuel Cahyawijaya
Zihan Liu
Pascale Fung
CVBM
230
100
0
17 Mar 2021
Predicting Opioid Use Disorder from Longitudinal Healthcare Data using
  Multi-stream Transformer
Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream TransformerAmerican Medical Informatics Association Annual Symposium (AMIA), 2021
S. Fouladvand
J. Talbert
L. Dwoskin
H. Bush
A. Meadows
Lars E. Peterson
Ramakanth Kavuluru
Jin Chen
200
5
0
16 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time
  Image-Text Retrieval
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
199
89
0
16 Mar 2021
A Survey on Multimodal Disinformation Detection
A Survey on Multimodal Disinformation DetectionInternational Conference on Computational Linguistics (COLING), 2021
Firoj Alam
S. Cresci
Tanmoy Chakraborty
Fabrizio Silvestri
Dimiter Dimitrov
Giovanni Da San Martino
Shaden Shaar
Hamed Firooz
Preslav Nakov
257
116
0
13 Mar 2021
Unified Pre-training for Program Understanding and Generation
Unified Pre-training for Program Understanding and GenerationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wasi Uddin Ahmad
Saikat Chakraborty
Baishakhi Ray
Kai-Wei Chang
417
851
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal
  Tasks with Language and Vision
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
289
49
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Causal Attention for Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2021
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
224
193
0
05 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
  Machine Learning
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
529
390
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Jialin Li
Jingren Zhou
J. Tang
Hongxia Yang
VLMMoE
345
147
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs.
  Where Research Efforts Go
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts GoACM Computing Surveys (CSUR), 2021
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
264
71
0
27 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
UniT: Multimodal Multitask Learning with a Unified TransformerIEEE International Conference on Computer Vision (ICCV), 2021
Ronghang Hu
Amanpreet Singh
ViT
358
343
0
22 Feb 2021
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout
  Transformer
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout TransformerIEEE International Conference on Document Analysis and Recognition (ICDAR), 2021
Rafal Powalski
Łukasz Borchmann
Dawid Jurkiewicz
Tomasz Dwojak
Michal Pietruszka
Gabriela Pałka
ViT
356
184
0
18 Feb 2021
Hierarchical Similarity Learning for Language-based Product Image
  Retrieval
Hierarchical Similarity Learning for Language-based Product Image RetrievalIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
Zhe Ma
Fenghao Liu
Jianfeng Dong
Xiaoye Qu
Yuan He
S. Ji
VLM
154
7
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,360
0
17 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
LambdaNetworks: Modeling Long-Range Interactions Without AttentionInternational Conference on Learning Representations (ICLR), 2021
Irwan Bello
509
187
0
17 Feb 2021
Biomedical Question Answering: A Survey of Approaches and Challenges
Biomedical Question Answering: A Survey of Approaches and ChallengesACM Computing Surveys (CSUR), 2021
Qiao Jin
Zheng Yuan
Guangzhi Xiong
Qian Yu
Huaiyuan Ying
Chuanqi Tan
Mosha Chen
Songfang Huang
Xiaozhong Liu
Sheng Yu
250
123
0
10 Feb 2021
Referring Segmentation in Images and Videos with Cross-Modal
  Self-Attention Network
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention NetworkIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Linwei Ye
Mrigank Rochan
Zhi Liu
Xiaoqin Zhang
Yang Wang
VOSEgoV
131
65
0
09 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of
  Pre-trained Language Models
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language ModelsIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
176
12
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region
  Supervision
ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionInternational Conference on Machine Learning (ICML), 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLMCLIP
547
2,107
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for
  Multimodal NER
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NERAAAI Conference on Artificial Intelligence (AAAI), 2021
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
159
172
0
05 Feb 2021
Inferring spatial relations from textual descriptions of images
Inferring spatial relations from textual descriptions of imagesPattern Recognition (Pattern Recogn.), 2021
A. Elu
Gorka Azkune
Oier López de Lacalle
Ignacio Arganda-Carreras
Aitor Soroa Etxabe
Eneko Agirre
139
2
0
01 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
238
126
0
31 Jan 2021
An Empirical Study on the Generalization Power of Neural Representations
  Learned via Visual Guessing Games
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing GamesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Alessandro Suglia
Yonatan Bisk
Ioannis Konstas
Antonio Vergari
E. Bastianelli
Andrea Vanzo
Oliver Lemon
138
8
0
31 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled
  Encoder-Decoder Network
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder NetworkAAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
157
58
0
27 Jan 2021
Adversarial Text-to-Image Synthesis: A Review
Adversarial Text-to-Image Synthesis: A ReviewNeural Networks (NN), 2021
Stanislav Frolov
Tobias Hinz
Federico Raue
Jörn Hees
Andreas Dengel
EGVM
322
201
0
25 Jan 2021
Latent Variable Models for Visual Question Answering
Latent Variable Models for Visual Question Answering
Zixu Wang
Yishu Miao
Lucia Specia
237
5
0
16 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of
  Supplemental Knowledge
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
213
31
0
15 Jan 2021
Latent Alignment of Procedural Concepts in Multimodal Recipes
Latent Alignment of Procedural Concepts in Multimodal Recipes
Hossein Rajaby Faghihi
Roshanak Mirzaee
Sudarshan Paliwal
Parisa Kordjamshidi
112
3
0
12 Jan 2021
MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding
MSD: Saliency-aware Knowledge Distillation for Multimodal UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Woojeong Jin
Maziar Sanjabi
Shaoliang Nie
L Tan
Xiang Ren
Hamed Firooz
163
6
0
06 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A SurveyACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
924
3,176
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjDVLM
513
168
0
02 Jan 2021
VisualSparta: An Embarrassingly Simple Approach to Large-scale
  Text-to-Image Search with Weighted Bag-of-words
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-wordsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Xiaopeng Lu
Tiancheng Zhao
Kyusong Lee
268
29
0
01 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via
  Cross-Modal Contrastive Learning
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
795
406
0
31 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
  Contexts
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
371
31
0
30 Dec 2020
Detecting Hate Speech in Multi-modal Memes
Detecting Hate Speech in Multi-modal Memes
Abhishek Das
Japsimar Singh Wahi
Siyao Li
136
75
0
29 Dec 2020
Detecting Hate Speech in Memes Using Multimodal Deep Learning
  Approaches: Prize-winning solution to Hateful Memes Challenge
Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge
Riza Velioglu
J. Rose
VLM
121
103
0
23 Dec 2020
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attentionInternational Conference on Machine Learning (ICML), 2020
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Edouard Grave
ViT
649
8,277
0
23 Dec 2020
Previous
123...2223242526
Next