ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
Title
Understanding Chinese Video and Language via Contrastive Multimodal
  Pre-Training
Understanding Chinese Video and Language via Contrastive Multimodal Pre-TrainingACM Multimedia (ACM MM), 2021
Chenyi Lei
Shixian Luo
Yong Liu
Wanggui He
Jiamang Wang
Guoxin Wang
Haihong Tang
Chunyan Miao
Houqiang Li
147
47
0
19 Apr 2021
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Cross-Modal Retrieval Augmentation for Multi-Modal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Shir Gur
Natalia Neverova
C. Stauffer
Ser-Nam Lim
Douwe Kiela
A. Reiter
209
36
0
16 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
  Representation Learning
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningComputer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLMViT
396
303
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Compressing Visual-linguistic Model via Knowledge DistillationIEEE International Conference on Computer Vision (ICCV), 2021
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
263
115
0
05 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language
  Pre-training
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2021
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLMVLM
221
104
0
01 Apr 2021
A Survey on Natural Language Video Localization
A Survey on Natural Language Video Localization
Xinfang Liu
Xiushan Nie
Zhifang Tan
Jie Guo
Yilong Yin
229
9
0
01 Apr 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleCLIP: Text-Driven Manipulation of StyleGAN ImageryIEEE International Conference on Computer Vision (ICCV), 2021
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIPVLM
368
1,361
0
31 Mar 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
Diagnosing Vision-and-Language Navigation: What Really MattersNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
Xinze Wang
Qi Wu
Miguel P. Eckstein
Wenjie Wang
LM&Ro
213
55
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
  Transformers
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with TransformersComputer Vision and Pattern Recognition (CVPR), 2021
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
326
158
0
30 Mar 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Kaleido-BERT: Vision-Language Pre-training on Fashion DomainComputer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
314
133
0
30 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for
  High-Resolution Image Encoding
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image EncodingIEEE International Conference on Computer Vision (ICCV), 2021
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
266
366
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
  Retrieval
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2021
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
308
163
0
28 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
Multi-Modal Answer Validation for Knowledge-Based VQAAAAI Conference on Artificial Intelligence (AAAI), 2021
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
354
163
0
23 Mar 2021
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
  Improved Cross-Modal Retrieval
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal RetrievalTransactions of the Association for Computational Linguistics (TACL), 2021
Gregor Geigle
Jonas Pfeiffer
Nils Reimers
Ivan Vulić
Iryna Gurevych
281
61
0
22 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time
  Image-Text Retrieval
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
175
89
0
16 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal
  Tasks with Language and Vision
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
237
48
0
06 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
  Machine Learning
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
490
386
0
02 Mar 2021
M6: A Chinese Multimodal Pretrainer
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Jialin Li
Jingren Zhou
J. Tang
Hongxia Yang
VLMMoE
316
148
0
01 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language SupervisionInternational Conference on Machine Learning (ICML), 2021
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
2.0K
40,760
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
UniT: Multimodal Multitask Learning with a Unified TransformerIEEE International Conference on Computer Vision (ICCV), 2021
Ronghang Hu
Amanpreet Singh
ViT
294
340
0
22 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.1K
1,350
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse
  Sampling
Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingComputer Vision and Pattern Recognition (CVPR), 2021
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Joey Tianyi Zhou
Jingjing Liu
CLIP
412
744
0
11 Feb 2021
Telling the What while Pointing to the Where: Multimodal Queries for
  Image Retrieval
Telling the What while Pointing to the Where: Multimodal Queries for Image RetrievalIEEE International Conference on Computer Vision (ICCV), 2021
Soravit Changpinyo
Jordi Pont-Tuset
V. Ferrari
Radu Soricut
166
28
0
09 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of
  Pre-trained Language Models
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language ModelsIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
153
12
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region
  Supervision
ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionInternational Conference on Machine Learning (ICML), 2021
Wonjae Kim
Bokyung Son
Ildoo Kim
VLMCLIP
531
2,091
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for
  Multimodal NER
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NERAAAI Conference on Artificial Intelligence (AAAI), 2021
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
150
171
0
05 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal
  Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
222
125
0
31 Jan 2021
Scheduled Sampling in Vision-Language Pretraining with Decoupled
  Encoder-Decoder Network
Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder NetworkAAAI Conference on Artificial Intelligence (AAAI), 2021
Yehao Li
Yingwei Pan
Ting Yao
Jingwen Chen
Tao Mei
VLM
152
58
0
27 Jan 2021
VisualMRC: Machine Reading Comprehension on Document Images
VisualMRC: Machine Reading Comprehension on Document ImagesAAAI Conference on Artificial Intelligence (AAAI), 2021
Ryota Tanaka
Kyosuke Nishida
Sen Yoshida
266
186
0
27 Jan 2021
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Cross-lingual Visual Pre-training for Multimodal Machine TranslationConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021
Ozan Caglayan
Menekse Kuyu
Mustafa Sercan Amac
Pranava Madhyastha
Erkut Erdem
Aykut Erdem
Lucia Specia
VLM
155
53
0
25 Jan 2021
Reasoning over Vision and Language: Exploring the Benefits of
  Supplemental Knowledge
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
205
31
0
15 Jan 2021
Contextual Non-Local Alignment over Full-Scale Representation for
  Text-Based Person Search
Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search
Chen Gao
Guanyu Cai
Xinyang Jiang
Feng Zheng
Jinchao Zhang
Yifei Gong
Pai Peng
Xiao-Wei Guo
Xing Sun
DiffM
278
117
0
08 Jan 2021
Transformers in Vision: A Survey
Transformers in Vision: A SurveyACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
874
3,125
0
04 Jan 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
VinVL: Revisiting Visual Representations in Vision-Language Models
Pengchuan Zhang
Xiujun Li
Xiaowei Hu
Jianwei Yang
Lei Zhang
Lijuan Wang
Yejin Choi
Jianfeng Gao
ObjDVLM
473
167
0
02 Jan 2021
VisualSparta: An Embarrassingly Simple Approach to Large-scale
  Text-to-Image Search with Weighted Bag-of-words
VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-wordsAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Xiaopeng Lu
Tiancheng Zhao
Kyusong Lee
249
29
0
01 Jan 2021
UNIMO: Towards Unified-Modal Understanding and Generation via
  Cross-Modal Contrastive Learning
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
735
406
0
31 Dec 2020
Accurate Word Representations with Universal Visual Guidance
Accurate Word Representations with Universal Visual Guidance
Zhuosheng Zhang
Haojie Yu
Hai Zhao
Rui Wang
Masao Utiyama
163
0
0
30 Dec 2020
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
  Contexts
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
367
31
0
30 Dec 2020
Seeing past words: Testing the cross-modal capabilities of pretrained
  V&L models on counting tasks
Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks
Letitia Parcalabescu
Albert Gatt
Anette Frank
Iacer Calixto
LRM
291
50
0
22 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
  Knowledge-Based VQA
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQAComputer Vision and Pattern Recognition (CVPR), 2020
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
248
224
0
20 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained
  Models
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
241
50
0
15 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual
  Commonsense Reasoning
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense ReasoningKnowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSLLRM
232
42
0
13 Dec 2020
MiniVLM: A Smaller and Faster Vision-Language Model
MiniVLM: A Smaller and Faster Vision-Language Model
Jianfeng Wang
Xiaowei Hu
Pengchuan Zhang
Xiujun Li
Lijuan Wang
Guang Dai
Jianfeng Gao
Zicheng Liu
VLMMLLM
214
70
0
13 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
234
158
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation
  Learning
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
226
87
0
08 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of
  Hateful Memes Challenge
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
140
16
0
02 Dec 2020
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
  of Vision-and-Language BERTs
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTsTransactions of the Association for Computational Linguistics (TACL), 2020
Emanuele Bugliarello
Robert Bamler
Naoaki Okazaki
Desmond Elliott
228
125
0
30 Nov 2020
A Recurrent Vision-and-Language BERT for Navigation
A Recurrent Vision-and-Language BERT for NavigationComputer Vision and Pattern Recognition (CVPR), 2020
Yicong Hong
Qi Wu
Yuankai Qi
Cristian Rodriguez-Opazo
Stephen Gould
LM&Ro
310
378
0
26 Nov 2020
Multimodal Learning for Hateful Memes Detection
Multimodal Learning for Hateful Memes Detection
Yi Zhou
Zhenhao Chen
293
71
0
25 Nov 2020
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform
  for NLP Applications
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP ApplicationsInternational Conference on Information and Knowledge Management (CIKM), 2020
Minghui Qiu
Peng Li
Chengyu Wang
Hanjie Pan
Yaliang Li
...
Jun Yang
Yaliang Li
Yanjie Liang
Deng Cai
Jialin Li
VLMSyDa
330
20
0
18 Nov 2020
Previous
123...101189
Next