Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,088 papers shown
Title
Learning to Generate Scene Graph from Natural Language Supervision
Yiwu Zhong
Jing Shi
Jianwei Yang
Chenliang Xu
Yin Li
SSL
31
77
0
06 Sep 2021
Data Efficient Masked Language Modeling for Vision and Language
Yonatan Bitton
Gabriel Stanovsky
Michael Elhadad
Roy Schwartz
VLM
11
20
0
05 Sep 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
17
1
0
04 Sep 2021
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering
Pratyay Banerjee
Tejas Gokhale
Yezhou Yang
Chitta Baral
LRM
30
18
0
04 Sep 2021
Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic
Wenjia Zhang
Lin Gui
Yulan He
25
32
0
04 Sep 2021
Multimodal Conditionality for Natural Language Generation
Michael Sollami
Aashish Jain
21
10
0
02 Sep 2021
Point-of-Interest Type Prediction using Text and Images
Danae Sánchez Villegas
Nikolaos Aletras
6
14
0
01 Sep 2021
WebQA: Multihop and Multimodal QA
Yingshan Chang
M. Narang
Hisami Suzuki
Guihong Cao
Jianfeng Gao
Yonatan Bisk
LRM
10
77
0
01 Sep 2021
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations
Hang Li
Yunxing Kang
Tianqiao Liu
Wenbiao Ding
Zitao Liu
28
17
0
01 Sep 2021
On the Significance of Question Encoder Sequence Model in the Out-of-Distribution Performance in Visual Question Answering
K. Gouthaman
Anurag Mittal
CML
42
0
0
28 Aug 2021
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training
Yuqing Song
Shizhe Chen
Qin Jin
Wei Luo
Jun Xie
Fei Huang
16
18
0
25 Aug 2021
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter
Hanbo Zhang
Yunfan Lu
Cunjun Yu
David Hsu
Xuguang Lan
Nanning Zheng
LM&Ro
21
63
0
25 Aug 2021
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Zirui Wang
Jiahui Yu
Adams Wei Yu
Zihang Dai
Yulia Tsvetkov
Yuan Cao
VLM
MLLM
51
779
0
24 Aug 2021
TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment
Jianwei Yang
Yonatan Bisk
Jianfeng Gao
19
137
0
23 Aug 2021
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network
Yuxin Wang
Hongtao Xie
Shancheng Fang
Jing Wang
Shenggao Zhu
Yongdong Zhang
VLM
49
152
0
22 Aug 2021
Multimodal Breast Lesion Classification Using Cross-Attention Deep Networks
Hung Q. Vo
Pengyu Yuan
T. He
Stephen T. C. Wong
H. Nguyen
13
1
0
21 Aug 2021
Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training
Ming Yan
Haiyang Xu
Chenliang Li
Bin Bi
Junfeng Tian
Min Gui
Wei Wang
VLM
33
10
0
21 Aug 2021
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur
Makarand Tapaswi
Shizhe Chen
Ivan Laptev
Cordelia Schmid
LM&Ro
19
135
0
20 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
20
32
0
20 Aug 2021
Detection of Illicit Drug Trafficking Events on Instagram: A Deep Multimodal Multilabel Learning Approach
Chuanbo Hu
Minglei Yin
Bing Liu
Xin Li
Yanfang Ye
16
15
0
19 Aug 2021
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
Yehao Li
Yingwei Pan
Jingwen Chen
Ting Yao
Tao Mei
VLM
19
31
0
18 Aug 2021
Who's Waldo? Linking People Across Text and Images
Claire Yuqing Cui
Apoorv Khandelwal
Yoav Artzi
Noah Snavely
Hadar Averbuch-Elor
23
21
0
16 Aug 2021
MMChat: Multi-Modal Chat Dataset on Social Media
Yinhe Zheng
Guanyi Chen
Xin Liu
K. Lin
9
33
0
16 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
19
53
0
16 Aug 2021
Video Transformer for Deepfake Detection with Incremental Learning
Sohail Ahmed Khan
Hang Dai
ViT
16
62
0
11 Aug 2021
BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis
Masoud Monajatipoor
Mozhdeh Rouhsedaghat
Liunian Harold Li
Aichi Chien
C.-C. Jay Kuo
Fabien Scalzo
Kai-Wei Chang
LM&MA
MedIm
22
30
0
10 Aug 2021
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion
Alessandro Suglia
Qiaozi Gao
Jesse Thomason
Govind Thattai
Gaurav Sukhatme
LM&Ro
29
76
0
10 Aug 2021
Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition
Ziwei Xu
Guangzhi Wang
Yongkang Wong
Mohan S. Kankanhalli
46
26
0
10 Aug 2021
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
Zheyuan Liu
Cristian Rodriguez-Opazo
Damien Teney
Stephen Gould
VLM
19
192
0
09 Aug 2021
Disentangling Hate in Online Memes
Rui Cao
Ziqing Fan
Roy Ka-Wei Lee
Wen-Haw Chong
Jing Jiang
24
76
0
09 Aug 2021
Detecting Propaganda Techniques in Memes
Dimitar Dimitrov
Bishr Bin Ali
Shaden Shaar
Firoj Alam
Fabrizio Silvestri
Hamed Firooz
Preslav Nakov
Giovanni Da San Martino
40
93
0
07 Aug 2021
Interpretable Visual Understanding with Cognitive Attention Network
Xuejiao Tang
Wenbin Zhang
Yi Yu
Kea Turner
Tyler Derr
Mengyu Wang
Eirini Ntoutsi
44
12
0
06 Aug 2021
StrucTexT: Structured Text Understanding with Multi-Modal Transformers
Yulin Li
Yuxi Qian
Yuchen Yu
Xiameng Qin
Chengquan Zhang
Yan Liu
Kun Yao
Junyu Han
Jingtuo Liu
Errui Ding
27
113
0
06 Aug 2021
Fast Convergence of DETR with Spatially Modulated Co-Attention
Peng Gao
Minghang Zheng
Xiaogang Wang
Jifeng Dai
Hongsheng Li
ViT
14
305
0
05 Aug 2021
Exploiting BERT For Multimodal Target Sentiment Classification Through Input Space Translation
Zaid Khan
Y. Fu
25
131
0
03 Aug 2021
Representation learning for neural population activity with Neural Data Transformers
Joel Ye
C. Pandarinath
AI4TS
AI4CE
11
51
0
02 Aug 2021
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
Rinon Gal
Or Patashnik
Haggai Maron
Gal Chechik
Daniel Cohen-Or
CLIP
VLM
28
220
0
02 Aug 2021
Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding
Heng Zhao
Joey Tianyi Zhou
Yew-Soon Ong
ObjD
19
23
0
31 Jul 2021
Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining
Xunlin Zhan
Yangxin Wu
Xiao Dong
Yunchao Wei
Minlong Lu
Yichi Zhang
Hang Xu
Xiaodan Liang
ViT
21
64
0
30 Jul 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions
Anil Rahate
Rahee Walambe
S. Ramanna
K. Kotecha
21
135
0
29 Jul 2021
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
Pengfei Liu
Weizhe Yuan
Jinlan Fu
Zhengbao Jiang
Hiroaki Hayashi
Graham Neubig
VLM
SyDa
31
3,828
0
28 Jul 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
37
2
0
27 Jul 2021
Language Grounding with 3D Objects
Jesse Thomason
Mohit Shridhar
Yonatan Bisk
Chris Paxton
Luke Zettlemoyer
LM&Ro
12
52
0
26 Jul 2021
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
Yuren Cong
Wentong Liao
H. Ackermann
Bodo Rosenhahn
M. Yang
ViT
11
122
0
26 Jul 2021
Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Tongtong Liu
Fangxiang Feng
Xiaojie Wang
11
14
0
22 Jul 2021
DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework
Haiwen Hong
Xuan Jin
Yin Zhang
Yunqing Hu
Jingfeng Zhang
Yuan He
Hui Xue
MoE
19
0
0
21 Jul 2021
Neural Variational Learning for Grounded Language Acquisition
Nisha Pillai
Cynthia Matuszek
Francis Ferraro
VLM
SSL
GAN
DRL
17
2
0
20 Jul 2021
Neural Abstructions: Abstractions that Support Construction for Grounded Language Learning
Kaylee Burns
Christopher D. Manning
Li Fei-Fei
19
0
0
20 Jul 2021
Separating Skills and Concepts for Novel Visual Question Answering
Spencer Whitehead
Hui Wu
Heng Ji
Rogerio Feris
Kate Saenko
CoGe
30
34
0
19 Jul 2021
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images
Nyoungwoo Lee
Suwon Shin
Jaegul Choo
Ho‐Jin Choi
S. Myaeng
11
25
0
19 Jul 2021
Previous
1
2
3
...
32
33
34
...
40
41
42
Next