ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSL
    VLM
ArXivPDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,088 papers shown
Title
CDLM: Cross-Document Language Modeling
CDLM: Cross-Document Language Modeling
Avi Caciularu
Arman Cohan
Iz Beltagy
Matthew E. Peters
Arie Cattan
Ido Dagan
VLM
19
32
0
02 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via
  Cross-Modal Contrastive Learning
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua-Hong Wu
Haifeng Wang
20
375
0
31 Dec 2020
Accurate Word Representations with Universal Visual Guidance
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui-cang Wang
Masao Utiyama
17
0
0
30 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
  Contexts
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Fei Wu
Rui Yan
Jiwei Li
16
28
0
30 Dec 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
  Understanding
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViT
MLLM
145
498
0
29 Dec 2020
Detecting Hate Speech in Memes Using Multimodal Deep Learning
  Approaches: Prize-winning solution to Hateful Memes Challenge
Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge
Riza Velioglu
J. Rose
VLM
6
85
0
23 Dec 2020
Training data-efficient image transformers & distillation through
  attention
Training data-efficient image transformers & distillation through attention
Hugo Touvron
Matthieu Cord
Matthijs Douze
Francisco Massa
Alexandre Sablayrolles
Hervé Jégou
ViT
16
6,562
0
23 Dec 2020
A Multimodal Framework for the Detection of Hateful Memes
A Multimodal Framework for the Detection of Hateful Memes
Phillip Lippe
Nithin Holla
Shantanu Chandra
S. Rajamanickam
Georgios Antoniou
Ekaterina Shutova
H. Yannakoudakis
6
70
0
23 Dec 2020
Seeing past words: Testing the cross-modal capabilities of pretrained
  V&L models on counting tasks
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
LRM
11
48
0
22 Dec 2020
ActionBert: Leveraging User Actions for Semantic Understanding of User
  Interfaces
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby B. Lee
Jindong Chen
Blaise Agüera y Arcas
11
78
0
22 Dec 2020
Object-Centric Diagnosis of Visual Reasoning
Object-Centric Diagnosis of Visual Reasoning
Jianwei Yang
Jiayuan Mao
Jiajun Wu
Devi Parikh
David D. Cox
J. Tenenbaum
Chuang Gan
OCL
19
16
0
21 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
  Knowledge-Based VQA
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
8
179
0
20 Dec 2020
Transformer Interpretability Beyond Attention Visualization
Transformer Interpretability Beyond Attention Visualization
Hila Chefer
Shir Gur
Lior Wolf
20
641
0
17 Dec 2020
MELINDA: A Multimodal Dataset for Biomedical Experiment Method
  Classification
MELINDA: A Multimodal Dataset for Biomedical Experiment Method Classification
Te-Lin Wu
Shikhar Singh
S. Paul
Gully A. Burns
Nanyun Peng
22
18
0
16 Dec 2020
ReINTEL: A Multimodal Data Challenge for Responsible Information
  Identification on Social Network Sites
ReINTEL: A Multimodal Data Challenge for Responsible Information Identification on Social Network Sites
Duc-Trong Le
Xuan-Son Vu
Nhu-Dung To
Huu Nguyen
Thuy-Trinh Nguyen
...
A. Nguyen
Minh-Duc Hoang
Nghia T. Le
Huyen Thi Minh Nguyen
Hoang D. Nguyen
39
14
0
16 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained
  Models
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
25
42
0
15 Dec 2020
Attention over learned object embeddings enables complex visual
  reasoning
Attention over learned object embeddings enables complex visual reasoning
David Ding
Felix Hill
Adam Santoro
Malcolm Reynolds
M. Botvinick
OCL
14
69
0
15 Dec 2020
Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes
Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes
Niklas Muennighoff
6
64
0
14 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
  Commonsense Reasoning
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
17
38
0
13 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
L. Zhang
Jianfeng Gao
Zicheng Liu
VLM
MLLM
27
58
0
13 Dec 2020
Look Before you Speak: Visually Contextualized Utterances
Look Before you Speak: Visually Contextualized Utterances
Paul Hongsuck Seo
Arsha Nagrani
Cordelia Schmid
16
66
0
10 Dec 2020
Topological Planning with Transformers for Vision-and-Language
  Navigation
Topological Planning with Transformers for Vision-and-Language Navigation
Kevin Chen
Junshen K. Chen
Jo Chuang
Marynel Vázquez
Silvio Savarese
LM&Ro
17
99
0
09 Dec 2020
Hateful Memes Detection via Complementary Visual and Linguistic Networks
Hateful Memes Detection via Complementary Visual and Linguistic Networks
W. Zhang
Guihua Liu
Zhuohua Li
Fuqing Zhu
16
17
0
09 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
26
140
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation
  Learning
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
13
76
0
08 Dec 2020
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da
Maxwell Forbes
Rowan Zellers
Anthony Zheng
Jena D. Hwang
Antoine Bosselut
Yejin Choi
DiffM
12
12
0
08 Dec 2020
WeaQA: Weak Supervision via Captions for Visual Question Answering
WeaQA: Weak Supervision via Captions for Visual Question Answering
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
12
34
0
04 Dec 2020
Understanding Guided Image Captioning Performance across Domains
Understanding Guided Image Captioning Performance across Domains
Edwin G. Ng
Bo Pang
P. Sharma
Radu Soricut
19
24
0
04 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of
  Hateful Memes Challenge
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
12
15
0
02 Dec 2020
Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Open-Ended Multi-Modal Relational Reasoning for Video Question Answering
Haozheng Luo
Ruiyang Qin
Chenwei Xu
Guo Ye
Zening Luo
43
4
0
01 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
  of Vision-and-Language BERTs
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Emanuele Bugliarello
Ryan Cotterell
Naoaki Okazaki
Desmond Elliott
22
119
0
30 Nov 2020
Point and Ask: Incorporating Pointing into Visual Question Answering
Point and Ask: Incorporating Pointing into Visual Question Answering
Arjun Mani
Nobline Yoo
William Fu-Hinthorn
Olga Russakovsky
3DPC
23
37
0
27 Nov 2020
Learning from Lexical Perturbations for Consistent Visual Question
  Answering
Learning from Lexical Perturbations for Consistent Visual Question Answering
Spencer Whitehead
Hui Wu
Yi Ren Fung
Heng Ji
Rogerio Feris
Kate Saenko
29
11
0
26 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
A Recurrent Vision-and-Language BERT for Navigation
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
27
292
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
8
56
0
25 Nov 2020
Open-Vocabulary Object Detection Using Captions
Open-Vocabulary Object Detection Using Captions
Alireza Zareian
Kevin Dela Rosa
Derek Hao Hu
Shih-Fu Chang
VLM
ObjD
40
417
0
20 Nov 2020
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform
  for NLP Applications
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP Applications
Minghui Qiu
Peng Li
Chengyu Wang
Hanjie Pan
Yaliang Li
...
Jun Yang
Yaliang Li
Jun Huang
Deng Cai
Wei Lin
VLM
SyDa
23
20
0
18 Nov 2020
A Hierarchical Multi-Modal Encoder for Moment Localization in Video
  Corpus
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
Bowen Zhang
Hexiang Hu
Joonseok Lee
Mingde Zhao
Sheide Chammas
Vihan Jain
Eugene Ie
Fei Sha
17
30
0
18 Nov 2020
Generating Natural Questions from Images for Multimodal Assistants
Generating Natural Questions from Images for Multimodal Assistants
Alkesh Patel
Akanksha Bindal
Hadas Kotek
Christopher Klein
Jason D. Williams
VGen
31
7
0
17 Nov 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Andreas Veit
Kimberly Wilber
9
2
0
17 Nov 2020
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
  Captioning and Video Question Answering
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
Aman Chadha
Gurneet Arora
Navpreet Kaloty
11
35
0
16 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text Representations
Linchao Zhu
Yi Yang
ViT
29
417
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
8
82
0
10 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
16
94
0
10 Nov 2020
Long Range Arena: A Benchmark for Efficient Transformers
Long Range Arena: A Benchmark for Efficient Transformers
Yi Tay
Mostafa Dehghani
Samira Abnar
Yikang Shen
Dara Bahri
Philip Pham
J. Rao
Liu Yang
Sebastian Ruder
Donald Metzler
36
691
0
08 Nov 2020
Training Transformers for Information Security Tasks: A Case Study on
  Malicious URL Prediction
Training Transformers for Information Security Tasks: A Case Study on Malicious URL Prediction
Ethan M. Rudd
Ahmed Abdallah
17
5
0
05 Nov 2020
Cross-Media Keyphrase Prediction: A Unified Framework with
  Multi-Modality Multi-Head Attention and Image Wordings
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
Yue Wang
Jing Li
M. Lyu
Irwin King
11
16
0
03 Nov 2020
COOT: Cooperative Hierarchical Transformer for Video-Text Representation
  Learning
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
13
168
0
01 Nov 2020
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
Stanislav Frolov
Shailza Jolly
Jörn Hees
Andreas Dengel
EGVM
14
5
0
28 Oct 2020
Co-attentional Transformers for Story-Based Video Understanding
Co-attentional Transformers for Story-Based Video Understanding
Björn Bebensee
Byoung-Tak Zhang
14
4
0
27 Oct 2020
Previous
123...373839404142
Next