Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
Title
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training
ACM Multimedia (ACM MM), 2021
Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
147
47
0
19 Apr 2021
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Shir Gur
Natalia Neverova
C. Stauffer
Ser-Nam Lim
Douwe Kiela
A. Reiter
209
36
0
16 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
396
303
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
IEEE International Conference on Computer Vision (ICCV), 2021
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
263
115
0
05 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2021
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLM
VLM
221
104
0
01 Apr 2021
A Survey on Natural Language Video Localization
Xinfang Liu
Xiushan Nie
Zhifang Tan
Jie Guo
Yilong Yin
229
9
0
01 Apr 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
IEEE International Conference on Computer Vision (ICCV), 2021
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIP
VLM
368
1,361
0
31 Mar 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
213
55
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Computer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
326
158
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Computer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
314
133
0
30 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
IEEE International Conference on Computer Vision (ICCV), 2021
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
266
366
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
308
163
0
28 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
AAAI Conference on Artificial Intelligence (AAAI), 2021
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
354
163
0
23 Mar 2021
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
Transactions of the Association for Computational Linguistics (TACL), 2021
Gregor Geigle
Jonas Pfeiffer
Nils Reimers
Ivan Vulić
Iryna Gurevych
281
61
0
22 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
North American Chapter of the Association for Computational Linguistics (NAACL), 2021
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
175
89
0
16 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
International Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
237
48
0
06 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
490
386
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Jialin Li
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
316
148
0
01 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
International Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
2.0K
40,760
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
IEEE International Conference on Computer Vision (ICCV), 2021
Ronghang Hu
Amanpreet Singh
ViT
294
340
0
22 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Computer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,350
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Computer Vision and Pattern Recognition (CVPR), 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
412
744
0
11 Feb 2021
Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval
IEEE International Conference on Computer Vision (ICCV), 2021
Soravit Changpinyo
Jordi Pont-Tuset
V. Ferrari
Radu Soricut
166
28
0
09 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
153
12
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
International Conference on Machine Learning (ICML), 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
531
2,091
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
AAAI Conference on Artificial Intelligence (AAAI), 2021
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
150
171
0
05 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Transactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
222
125
0
31 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network
AAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
152
58
0
27 Jan 2021
VisualMRC: Machine Reading Comprehension on Document Images
AAAI Conference on Artificial Intelligence (AAAI), 2021
Ryota Tanaka
Kyosuke Nishida
Sen Yoshida
266
186
0
27 Jan 2021
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Ozan Caglayan
Menekse Kuyu
Mustafa Sercan Amac
Pranava Madhyastha
Erkut Erdem
Aykut Erdem
Lucia Specia
VLM
155
53
0
25 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
205
31
0
15 Jan 2021
Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search
Chen Gao
Guanyu Cai
Xinyang Jiang
Feng Zheng
Jinchao Zhang
Yifei Gong
Pai Peng
Xiao-Wei Guo
Xing Sun
DiffM
278
117
0
08 Jan 2021
Transformers in Vision: A Survey
ACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
874
3,125
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjD
VLM
473
167
0
02 Jan 2021
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Xiaopeng Lu
Tiancheng Zhao
Kyusong Lee
249
29
0
01 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
735
406
0
31 Dec 2020
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
163
0
0
30 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
367
31
0
30 Dec 2020
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
LRM
291
50
0
22 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Computer Vision and Pattern Recognition (CVPR), 2020
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
248
224
0
20 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
241
50
0
15 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Knowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
232
42
0
13 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
Guang Dai
Jianfeng Gao
Zicheng Liu
VLM
MLLM
214
70
0
13 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
234
158
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
226
87
0
08 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
140
16
0
02 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
Transactions of the Association for Computational Linguistics (TACL), 2020
Emanuele Bugliarello
Robert Bamler
Naoaki Okazaki
Desmond Elliott
228
125
0
30 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
Computer Vision and Pattern Recognition (CVPR), 2020
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
310
378
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
293
71
0
25 Nov 2020
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP Applications
International Conference on Information and Knowledge Management (CIKM), 2020
Minghui Qiu
Peng Li
Chengyu Wang
Hanjie Pan
Yaliang Li
...
Jun Yang
Yaliang Li
Yanjie Liang
Deng Cai
Jialin Li
VLM
SyDa
330
20
0
18 Nov 2020
Previous
1
2
3
...
10
11
8
9
Next