Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,232 papers shown
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Po-Yao (Bernie) Huang
Mandela Patrick
Junjie Hu
Graham Neubig
Florian Metze
Alexander G. Hauptmann
MLLM
VLM
323
60
0
16 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
199
89
0
16 Mar 2021
SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
Chenliang Li
Ming Yan
Haiyang Xu
Fuli Luo
Wei Wang
Bin Bi
Songfang Huang
VLM
151
39
0
14 Mar 2021
A Survey on Multimodal Disinformation Detection
International Conference on Computational Linguistics (COLING), 2021
Firoj Alam
S. Cresci
Tanmoy Chakraborty
Fabrizio Silvestri
Dimiter Dimitrov
Giovanni Da San Martino
Shaden Shaar
Hamed Firooz
Preslav Nakov
257
116
0
13 Mar 2021
What is Multimodality?
Letitia Parcalabescu
Nils Trost
Anette Frank
230
0
0
10 Mar 2021
Pretrained Transformers as Universal Computation Engines
Kevin Lu
Aditya Grover
Pieter Abbeel
Igor Mordatch
299
230
0
09 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
International Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
289
50
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Computer Vision and Pattern Recognition (CVPR), 2021
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
228
193
0
05 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
529
390
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Jialin Li
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
345
147
0
01 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
International Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
2.0K
41,575
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
Ronghang Hu
Amanpreet Singh
ViT
361
343
0
22 Feb 2021
Learning Compositional Representation for Few-shot Visual Question Answering
Dalu Guo
Dacheng Tao
OOD
CoGe
153
4
0
21 Feb 2021
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Computer Vision and Pattern Recognition (CVPR), 2021
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
VLM
454
274
0
20 Feb 2021
Hierarchical Similarity Learning for Language-based Product Image Retrieval
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021
Zhe Ma
Fenghao Liu
Jianfeng Dong
Xiaoye Qu
Yuan He
S. Ji
VLM
154
7
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Computer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,360
0
17 Feb 2021
LambdaNetworks: Modeling Long-Range Interactions Without Attention
International Conference on Learning Representations (ICLR), 2021
Irwan Bello
509
187
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Computer Vision and Pattern Recognition (CVPR), 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
458
748
0
11 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
International Conference on Machine Learning (ICML), 2021
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
1.3K
4,893
0
11 Feb 2021
Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Soravit Changpinyo
Jordi Pont-Tuset
V. Ferrari
Radu Soricut
197
28
0
09 Feb 2021
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
Linwei Ye
Mrigank Rochan
Zhi Liu
Xiaoqin Zhang
Yang Wang
VOS
EgoV
131
65
0
09 Feb 2021
Iconographic Image Captioning for Artworks
E. Cetinic
157
27
0
07 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
176
12
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
International Conference on Machine Learning (ICML), 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
547
2,107
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
AAAI Conference on Artificial Intelligence (AAAI), 2021
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
159
172
0
05 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
International Conference on Machine Learning (ICML), 2021
Jaemin Cho
Jie Lei
Hao Tan
Joey Tianyi Zhou
MLLM
599
609
0
04 Feb 2021
Inferring spatial relations from textual descriptions of images
Pattern Recognition (Pattern Recogn.), 2021
A. Elu
Gorka Azkune
Oier López de Lacalle
Ignacio Arganda-Carreras
Aitor Soroa Etxabe
Eneko Agirre
139
2
0
01 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Transactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
238
126
0
31 Jan 2021
An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Alessandro Suglia
Yonatan Bisk
Ioannis Konstas
Antonio Vergari
E. Bastianelli
Andrea Vanzo
Oliver Lemon
142
8
0
31 Jan 2021
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
Computer Vision and Pattern Recognition (CVPR), 2021
Xudong Lin
Gedas Bertasius
Jue Wang
Shih-Fu Chang
Devi Parikh
Lorenzo Torresani
VGen
249
74
0
28 Jan 2021
Bottleneck Transformers for Visual Recognition
Computer Vision and Pattern Recognition (CVPR), 2021
A. Srinivas
Nayeon Lee
Niki Parmar
Jonathon Shlens
Pieter Abbeel
Ashish Vaswani
SLR
703
1,124
0
27 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
AAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
157
58
0
27 Jan 2021
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Ozan Caglayan
Menekse Kuyu
Mustafa Sercan Amac
Pranava Madhyastha
Erkut Erdem
Aykut Erdem
Lucia Specia
VLM
199
53
0
25 Jan 2021
Adversarial Text-to-Image Synthesis: A Review
Neural Networks (NN), 2021
Stanislav Frolov
Tobias Hinz
Federico Raue
Jörn Hees
Andreas Dengel
EGVM
322
201
0
25 Jan 2021
Visual Question Answering based on Local-Scene-Aware Referring Expression Generation
Neural Networks (NN), 2021
Jungjun Kim
Dong-Gyu Lee
Jialin Wu
Hong G Jung
Seong-Whan Lee
ObjD
185
24
0
22 Jan 2021
SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation
Computer Vision and Pattern Recognition (CVPR), 2021
Brendan Duke
Abdalla Ahmed
Christian Wolf
P. Aarabi
Graham W. Taylor
VOS
239
188
0
21 Jan 2021
Learning rich touch representations through cross-modal self-supervision
Conference on Robot Learning (CoRL), 2021
Martina Zambelli
Y. Aytar
Francesco Visin
Yuxiang Zhou
R. Hadsell
SSL
199
18
0
21 Jan 2021
Understanding in Artificial Intelligence
S. Maetschke
D. M. Iraola
Pieter Barnard
Elaheh Shafieibavani
Peter Zhong
Ying Xu
Antonio Jimeno Yepes
ELM
VLM
188
0
0
17 Jan 2021
Latent Variable Models for Visual Question Answering
Zixu Wang
Yishu Miao
Lucia Specia
237
5
0
16 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
213
31
0
15 Jan 2021
Probabilistic Embeddings for Cross-Modal Retrieval
Computer Vision and Pattern Recognition (CVPR), 2021
Sanghyuk Chun
Seong Joon Oh
Rafael Sampaio de Rezende
Yannis Kalantidis
Diane Larlus
UQCV
909
261
0
13 Jan 2021
Trear: Transformer-based RGB-D Egocentric Action Recognition
IEEE Transactions on Cognitive and Developmental Systems (IEEE TCDS), 2021
Xiangyu Li
Yonghong Hou
Pichao Wang
Zhimin Gao
Mingliang Xu
Wanqing Li
ViT
389
99
0
05 Jan 2021
Transformers in Vision: A Survey
ACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
924
3,176
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjD
VLM
513
168
0
02 Jan 2021
KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Yiran Xing
Z. Shi
Zhao Meng
Gerhard Lakemeyer
Yunpu Ma
Roger Wattenhofer
VLM
281
45
0
02 Jan 2021
CDLM: Cross-Document Language Modeling
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Avi Caciularu
Arman Cohan
Iz Beltagy
Matthew E. Peters
Arie Cattan
Ido Dagan
VLM
239
34
0
02 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
797
406
0
31 Dec 2020
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
182
0
0
30 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
371
31
0
30 Dec 2020
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Yang Xu
Yiheng Xu
Tengchao Lv
Lei Cui
Furu Wei
...
D. Florêncio
Cha Zhang
Wanxiang Che
Min Zhang
Lidong Zhou
ViT
MLLM
846
610
0
29 Dec 2020
Previous
1
2
3
...
39
40
41
...
43
44
45
Next