ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,232 papers shown
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA
  Models
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA ModelsIEEE International Conference on Computer Vision (ICCV), 2021
Linjie Li
Jie Lei
Zhe Gan
Jingjing Liu
AAMLVLM
308
92
0
01 Jun 2021
M6-T: Exploring Sparse Expert Models and Beyond
M6-T: Exploring Sparse Expert Models and Beyond
An Yang
Junyang Lin
Rui Men
Chang Zhou
Le Jiang
...
Dingyang Zhang
Jialin Li
Lin Qu
Jingren Zhou
Hongxia Yang
MoE
368
24
0
31 May 2021
Dual-stream Network for Visual Recognition
Dual-stream Network for Visual RecognitionNeural Information Processing Systems (NeurIPS), 2021
Mingyuan Mao
Renrui Zhang
Honghui Zheng
Shiyang Feng
Teli Ma
Yan Peng
Errui Ding
Baochang Zhang
Shumin Han
ViT
278
78
0
31 May 2021
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
  Numerical Reasoning
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningFindings (Findings), 2021
Jiaqi Chen
Jianheng Tang
Jinghui Qin
Xiaodan Liang
Lingbo Liu
Eric Xing
Liang Lin
AIMat
224
251
0
30 May 2021
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Shuhe Wang
Yuxian Meng
Xiaofei Sun
Leilei Gan
Rongbin Ouyang
Rui Yan
Tianwei Zhang
Jiwei Li
224
15
0
30 May 2021
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis
  via Non-Autoregressive Generative Transformers
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers
Zhu Zhang
Jianxin Ma
Chang Zhou
Rui Men
Zhikang Li
Ming Ding
Jie Tang
Jingren Zhou
Hongxia Yang
352
47
0
29 May 2021
Maintaining Common Ground in Dynamic Environments
Maintaining Common Ground in Dynamic EnvironmentsTransactions of the Association for Computational Linguistics (TACL), 2021
Takuma Udagawa
Akiko Aizawa
167
15
0
29 May 2021
Learning Relation Alignment for Calibrated Cross-modal Retrieval
Learning Relation Alignment for Calibrated Cross-modal RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Shuhuai Ren
Junyang Lin
Guangxiang Zhao
Rui Men
An Yang
Jingren Zhou
Xu Sun
Hongxia Yang
212
39
0
28 May 2021
Maria: A Visual Experience Powered Conversational Agent
Maria: A Visual Experience Powered Conversational AgentAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Zujie Liang
Huang Hu
Can Xu
Chongyang Tao
Xiubo Geng
Yining Chen
Fan Liang
Daxin Jiang
202
33
0
27 May 2021
Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
Multi-Modal Semantic Inconsistency Detection in Social Media News PostsConference on Multimedia Modeling (MMM), 2021
S. McCrae
Kehan Wang
A. Zakhor
147
16
0
26 May 2021
Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Sam Yang
Grayson Hilliard
234
17
0
25 May 2021
Enhance Multimodal Model Performance with Data Augmentation: Facebook
  Hateful Meme Challenge Solution
Enhance Multimodal Model Performance with Data Augmentation: Facebook Hateful Meme Challenge Solution
Yang Li
Zi-xin Zhang
Hutchin Huang
173
1
0
25 May 2021
Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic
  Representation
Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic RepresentationComputer Vision and Pattern Recognition (CVPR), 2021
Tao Tu
Q. Ping
Govind Thattai
Gokhan Tur
Premkumar Natarajan
185
18
0
24 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via
  Vision-Language Pre-Training
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
235
211
0
24 May 2021
Human-centric Relation Segmentation: Dataset and Solution
Human-centric Relation Segmentation: Dataset and SolutionIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Si Liu
Zitian Wang
Yulu Gao
Lejian Ren
Yue Liao
Guanghui Ren
Bo Li
Shuicheng Yan
200
13
0
24 May 2021
Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning
Aligning Visual Prototypes with BERT Embeddings for Few-Shot LearningInternational Conference on Multimedia Retrieval (ICMR), 2021
Kun Yan
Zied Bouraoui
Ping Wang
Shoaib Jameel
Steven Schockaert
141
32
0
21 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video
  Understanding
VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
327
146
0
20 May 2021
Pathdreamer: A World Model for Indoor Navigation
Pathdreamer: A World Model for Indoor Navigation
Jing Yu Koh
Honglak Lee
Yinfei Yang
Jason Baldridge
Peter Anderson
354
114
0
18 May 2021
Parallel Attention Network with Sequence Matching for Video Grounding
Parallel Attention Network with Sequence Matching for Video GroundingFindings (Findings), 2021
Hao Zhang
Aixin Sun
Wei Jing
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
268
50
0
18 May 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsComputer Vision and Pattern Recognition (CVPR), 2021
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
392
721
0
18 May 2021
A Review on Explainability in Multimodal Deep Neural Nets
A Review on Explainability in Multimodal Deep Neural NetsIEEE Access (IEEE Access), 2021
Gargi Joshi
Rahee Walambe
K. Kotecha
381
171
0
17 May 2021
Survey of Visual-Semantic Embedding Methods for Zero-Shot Image
  Retrieval
Survey of Visual-Semantic Embedding Methods for Zero-Shot Image RetrievalInternational Conference on Machine Learning and Applications (ICMLA), 2021
K. Ueki
257
5
0
16 May 2021
Episodic Transformer for Vision-and-Language Navigation
Episodic Transformer for Vision-and-Language NavigationIEEE International Conference on Computer Vision (ICCV), 2021
Alexander Pashevich
Cordelia Schmid
Chen Sun
LM&Ro
346
212
0
13 May 2021
Video Corpus Moment Retrieval with Contrastive Learning
Video Corpus Moment Retrieval with Contrastive LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Hao Zhang
Aixin Sun
Wei Jing
Guoshun Nan
Liangli Zhen
Qiufeng Wang
Rick Siow Mong Goh
274
102
0
13 May 2021
Connecting What to Say With Where to Look by Modeling Human Attention
  Traces
Connecting What to Say With Where to Look by Modeling Human Attention TracesComputer Vision and Pattern Recognition (CVPR), 2021
Zihang Meng
Licheng Yu
Ning Zhang
Tamara L. Berg
Babak Damavandi
Vikas Singh
Amy Bearman
262
31
0
12 May 2021
VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
  Matching
VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching
Chenchi Zhang
Wenbo Ma
Jun Xiao
Hanwang Zhang
Jian Shao
Yueting Zhuang
Long Chen
289
5
0
12 May 2021
Language Acquisition is Embodied, Interactive, Emotive: a Research
  Proposal
Language Acquisition is Embodied, Interactive, Emotive: a Research Proposal
C. Kennington
LM&Ro
106
0
0
10 May 2021
Spoken Moments: Learning Joint Audio-Visual Representations from Video
  Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video DescriptionsComputer Vision and Pattern Recognition (CVPR), 2021
Mathew Monfort
SouYoung Jin
Alexander H. Liu
David Harwath
Rogerio Feris
James Glass
Aude Oliva
181
68
0
10 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic
  Survey
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
855
322
0
10 May 2021
A survey on VQA_Datasets and Approaches
A survey on VQA_Datasets and Approaches
Yeyun Zou
Qiyu Xie
277
21
0
02 May 2021
Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's Heads
Chop Chop BERT: Visual Question Answering by Chopping VisualBERT's HeadsInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Chenyu Gao
Qi Zhu
Peng Wang
Qi Wu
109
2
0
30 Apr 2021
Comparing Visual Reasoning in Humans and AI
Comparing Visual Reasoning in Humans and AI
Shravan Murlidaran
Wenjie Wang
Miguel P. Eckstein
197
1
0
29 Apr 2021
A First Look: Towards Explainable TextVQA Models via Visual and Textual
  Explanations
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations
Varun Nagaraj Rao
Xingjian Zhen
K. Hovsepian
Mingwei Shen
188
21
0
29 Apr 2021
Multimodal Contrastive Training for Visual Representation Learning
Multimodal Contrastive Training for Visual Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Xin Yuan
Zhe Lin
Jason Kuen
Jianming Zhang
Yilin Wang
Michael Maire
Ajinkya Kale
Baldo Faieta
SSL
252
191
0
26 Apr 2021
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding
MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingIEEE International Conference on Computer Vision (ICCV), 2021
Aishwarya Kamath
Mannat Singh
Yann LeCun
Gabriel Synnaeve
Ishan Misra
Nicolas Carion
ObjDVLM
644
1,058
0
26 Apr 2021
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and
  Images
SemEval-2021 Task 6: Detection of Persuasion Techniques in Texts and ImagesInternational Workshop on Semantic Evaluation (SemEval), 2021
Dimitar Dimitrov
Bishr Bin Ali
Shaden Shaar
Firoj Alam
Fabrizio Silvestri
Hamed Firooz
Preslav Nakov
Giovanni Da San Martino
150
120
0
25 Apr 2021
MusCaps: Generating Captions for Music Audio
MusCaps: Generating Captions for Music AudioIEEE International Joint Conference on Neural Network (IJCNN), 2021
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
290
43
0
24 Apr 2021
M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
  Detection with Transformers
M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with TransformersIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2021
Tianrui Guan
Jun Wang
Shiyi Lan
Rohan Chandra
Zuxuan Wu
Larry S. Davis
Tianyi Zhou
ViT3DPC
221
157
0
24 Apr 2021
Playing Lottery Tickets with Vision and Language
Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
312
62
0
23 Apr 2021
Multiscale Vision Transformers
Multiscale Vision TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Haoqi Fan
Bo Xiong
K. Mangalam
Yanghao Li
Zhicheng Yan
Jitendra Malik
Christoph Feichtenhofer
ViT
482
1,521
0
22 Apr 2021
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Comprehensive Multi-Modal Interactions for Referring Image SegmentationFindings (Findings), 2021
Kanishk Jain
Vineet Gandhi
237
19
0
21 Apr 2021
Understanding Synonymous Referring Expressions via Contrastive Features
Understanding Synonymous Referring Expressions via Contrastive FeaturesInternational Journal of Computer Vision (IJCV), 2021
Yi-Wen Chen
Yi-Hsuan Tsai
Ming-Hsuan Yang
ObjD
185
5
0
20 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
195
31
0
20 Apr 2021
Understanding Chinese Video and Language via Contrastive Multimodal
  Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-TrainingACM Multimedia (ACM MM), 2021
Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
163
47
0
19 Apr 2021
BM-NAS: Bilevel Multimodal Neural Architecture Search
BM-NAS: Bilevel Multimodal Neural Architecture SearchAAAI Conference on Artificial Intelligence (AAAI), 2021
Yihang Yin
Siyu Huang
Xiang Zhang
232
34
0
19 Apr 2021
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich
  Document Understanding
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
Yiheng Xu
Tengchao Lv
Lei Cui
Guoxin Wang
Yijuan Lu
D. Florêncio
Cha Zhang
Furu Wei
MLLMVLM
270
167
0
18 Apr 2021
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
CLIPScore: A Reference-free Evaluation Metric for Image CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Jack Hessel
Ari Holtzman
Maxwell Forbes
Ronan Le Bras
Yejin Choi
CLIP
969
2,298
0
18 Apr 2021
Cetacean Translation Initiative: a roadmap to deciphering the
  communication of sperm whales
Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales
Jacob Andreas
Gašper Beguš
M. Bronstein
R. Diamant
Denley Delaney
...
D. Tchernov
P. Tønnesen
Antonio Torralba
Daniel M. Vogt
Robert J. Wood
181
13
0
17 Apr 2021
TransVG: End-to-End Visual Grounding with Transformers
TransVG: End-to-End Visual Grounding with TransformersIEEE International Conference on Computer Vision (ICCV), 2021
Jiajun Deng
Zhengyuan Yang
Tianlang Chen
Wen-gang Zhou
Houqiang Li
ViT
621
442
0
17 Apr 2021
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding
Te-Lin Wu
Cheng-rong Li
Mingyang Zhang
Tao Chen
Spurthi Amba Hombaiah
Michael Bendersky
150
15
0
16 Apr 2021
Previous
123...373839...434445
Next
Page 38 of 45
Pageof 45