ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
What Vision-Language Models `See' when they See Scenes
What Vision-Language Models `See' when they See Scenes
Michele Cafagna
Kees van Deemter
Albert Gatt
VLM
256
13
0
15 Sep 2021
xGQA: Cross-Lingual Visual Question Answering
xGQA: Cross-Lingual Visual Question Answering
Jonas Pfeiffer
Gregor Geigle
Aishwarya Kamath
Jan-Martin O. Steitz
Stefan Roth
Ivan Vulić
Iryna Gurevych
357
78
0
13 Sep 2021
Constructing Phrase-level Semantic Labels to Form Multi-Grained
  Supervision for Image-Text Retrieval
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Haijun Shan
Xuanjing Huang
Jianqing Fan
CLIP
96
12
0
12 Sep 2021
Vision Guided Generative Pre-trained Language Models for Multimodal
  Abstractive Summarization
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive SummarizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Tiezheng Yu
Wenliang Dai
Zihan Liu
Pascale Fung
293
79
0
06 Sep 2021
Improving Joint Learning of Chest X-Ray and Radiology Report by Word
  Region Alignment
Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment
Zhanghexuan Ji
Mohammad Abuzar Shaikh
Dana Moukheiber
S. Srihari
Yifan Peng
Mingchen Gao
SSL
182
23
0
04 Sep 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by
  Image and Caption Generation
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
152
1
0
04 Sep 2021
Multimodal Conditionality for Natural Language Generation
Multimodal Conditionality for Natural Language Generation
Michael Sollami
Aashish Jain
115
10
0
02 Sep 2021
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
  Representations
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Hang Li
Yunxing Kang
Tianqiao Liu
Wenbiao Ding
Zitao Liu
166
20
0
01 Sep 2021
Product-oriented Machine Translation with Cross-modal Cross-lingual
  Pre-training
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-trainingACM Multimedia (ACM MM), 2021
Yuqing Song
Shizhe Chen
Qin Jin
Wei Luo
Jun Xie
Fei Huang
198
25
0
25 Aug 2021
Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training
Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training
Ming Yan
Haiyang Xu
Chenliang Li
Bin Bi
Junfeng Tian
Min Gui
Wei Wang
VLM
122
11
0
21 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
229
37
0
20 Aug 2021
Indoor Semantic Scene Understanding using Multi-modality Fusion
Indoor Semantic Scene Understanding using Multi-modality Fusion
Muraleekrishna Gopinathan
Giang Truong
Jumana Abu-Khalaf
157
0
0
17 Aug 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
  Intra-modal Knowledge Integration
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
166
58
0
16 Aug 2021
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
Rinon Gal
Or Patashnik
Haggai Maron
Gal Chechik
Daniel Cohen-Or
CLIPVLM
267
275
0
02 Aug 2021
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised
  Learning
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning
Jinyuan Jia
Yupei Liu
Neil Zhenqiang Gong
SILMSSL
305
184
0
01 Aug 2021
UIBert: Learning Generic Multimodal Representations for UI Understanding
UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021
Chongyang Bai
Xiaoxue Zang
Ying Xu
Srinivas Sunkara
Abhinav Rastogi
Jindong Chen
Blaise Agüera y Arcas
258
111
0
29 Jul 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
144
3
0
27 Jul 2021
DRDF: Determining the Importance of Different Multimodal Information
  with Dual-Router Dynamic Framework
DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic FrameworkACM Multimedia (ACM MM), 2021
Haiwen Hong
Xuan Jin
Yin Zhang
Yunqing Hu
Jingfeng Zhang
Yuan He
Hui Xue
MoE
104
0
0
21 Jul 2021
Separating Skills and Concepts for Novel Visual Question Answering
Separating Skills and Concepts for Novel Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2021
Spencer Whitehead
Hui Wu
Heng Ji
Rogerio Feris
Kate Saenko
CoGe
179
38
0
19 Jul 2021
Align before Fuse: Vision and Language Representation Learning with
  Momentum Distillation
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationNeural Information Processing Systems (NeurIPS), 2021
Junnan Li
Ramprasaath R. Selvaraju
Akhilesh Deepak Gotmare
Shafiq Joty
Caiming Xiong
Guosheng Lin
FaML
826
2,461
0
16 Jul 2021
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
  Transfer
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
Zineng Tang
Jaemin Cho
Hao Tan
Joey Tianyi Zhou
VLM
186
33
0
06 Jul 2021
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior
  for Joint Image-Text Modeling
PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling
Xiaoxue Zang
Lijuan Liu
Maria Wang
Yang Song
Hao Zhang
Jindong Chen
VLM
239
65
0
06 Jul 2021
Productivity, Portability, Performance: Data-Centric Python
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
402
111
0
01 Jul 2021
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and
  Generation
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Jing Liu
Xinxin Zhu
Fei Liu
Longteng Guo
Zijia Zhao
...
Weining Wang
Hanqing Lu
Shiyu Zhou
Jiajun Zhang
Jinqiao Wang
285
41
0
01 Jul 2021
Probing Inter-modality: Visual Parsing with Self-Attention for
  Vision-Language Pre-training
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
403
93
0
25 Jun 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training
  for VQA Challenge 2021
A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021
Keda Lu
Bo Fang
Kuan-Yu Chen
ViT
92
2
0
24 Jun 2021
Towards Long-Form Video Understanding
Towards Long-Form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021
Chaoxia Wu
Philipp Krahenbuhl
VLMViT
314
193
0
21 Jun 2021
GEM: A General Evaluation Benchmark for Multimodal Tasks
GEM: A General Evaluation Benchmark for Multimodal TasksFindings (Findings), 2021
Lin Su
Nan Duan
Edward Cui
Lei Ji
Chenfei Wu
Huaishao Luo
Yongfei Liu
Ming Zhong
Taroon Bharti
Arun Sacheti
VLM
193
22
0
18 Jun 2021
Efficient Self-supervised Vision Transformers for Representation
  Learning
Efficient Self-supervised Vision Transformers for Representation LearningInternational Conference on Learning Representations (ICLR), 2021
Chunyuan Li
Jianwei Yang
Pengchuan Zhang
Mei Gao
Bin Xiao
Xiyang Dai
Lu Yuan
Jianfeng Gao
ViT
287
221
0
17 Jun 2021
Probing Image-Language Transformers for Verb Understanding
Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks
Aida Nematzadeh
211
131
0
16 Jun 2021
Pre-Trained Models: Past, Present and Future
Pre-Trained Models: Past, Present and FutureAI Open (AO), 2021
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFinMQAI4MH
384
985
0
14 Jun 2021
Assessing Multilingual Fairness in Pre-trained Multimodal
  Representations
Assessing Multilingual Fairness in Pre-trained Multimodal RepresentationsFindings (Findings), 2021
Jialu Wang
Yang Liu
Xinze Wang
EGVM
233
42
0
12 Jun 2021
Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object
  Localization
Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization
Ludan Ruan
Jieting Chen
Yuqing Song
Shizhe Chen
Qin Jin
84
0
0
11 Jun 2021
Chasing Sparsity in Vision Transformers: An End-to-End Exploration
Chasing Sparsity in Vision Transformers: An End-to-End ExplorationNeural Information Processing Systems (NeurIPS), 2021
Tianlong Chen
Yu Cheng
Zhe Gan
Lu Yuan
Lei Zhang
Zinan Lin
ViT
242
255
0
08 Jun 2021
BERTGEN: Multi-task Generation through BERT
BERTGEN: Multi-task Generation through BERTAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Faidon Mitzalis
Ozan Caglayan
Pranava Madhyastha
Lucia Specia
VLM
108
7
0
07 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
MERLOT: Multimodal Neural Script Knowledge ModelsNeural Information Processing Systems (NeurIPS), 2021
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLMLRM
348
425
0
04 Jun 2021
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
  Learning
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2021
Haiyang Xu
Ming Yan
Chenliang Li
Bin Bi
Songfang Huang
Wenming Xiao
Fei Huang
VLM
310
126
0
03 Jun 2021
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
  Numerical Reasoning
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningFindings (Findings), 2021
Jiaqi Chen
Jianheng Tang
Jinghui Qin
Xiaodan Liang
Lingbo Liu
Eric Xing
Liang Lin
AIMat
218
251
0
30 May 2021
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Shuhe Wang
Yuxian Meng
Xiaofei Sun
Leilei Gan
Rongbin Ouyang
Rui Yan
Tianwei Zhang
Jiwei Li
220
15
0
30 May 2021
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis
  via Non-Autoregressive Generative Transformers
M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers
Zhu Zhang
Jianxin Ma
Chang Zhou
Rui Men
Zhikang Li
Ming Ding
Jie Tang
Jingren Zhou
Hongxia Yang
345
47
0
29 May 2021
Multi-Modal Semantic Inconsistency Detection in Social Media News Posts
Multi-Modal Semantic Inconsistency Detection in Social Media News PostsConference on Multimedia Modeling (MMM), 2021
S. McCrae
Kehan Wang
A. Zakhor
142
16
0
26 May 2021
Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
  Spell Checking
Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell CheckingFindings (Findings), 2021
Heng-Da Xu
Zhongli Li
Qingyu Zhou
Chao Li
Zizhen Wang
Yunbo Cao
Heyan Huang
Xian-Ling Mao
195
109
0
26 May 2021
Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
Understanding Mobile GUI: from Pixel-Words to Screen-Sentences
Jingwen Fu
Xiaoyi Zhang
Yuwang Wang
Wenjun Zeng
Sam Yang
Grayson Hilliard
226
16
0
25 May 2021
Multi-modal Understanding and Generation for Medical Images and Text via
  Vision-Language Pre-Training
Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021
Jong Hak Moon
HyunGyung Lee
W. Shin
Young-Hak Kim
Edward Choi
MedIm
221
210
0
24 May 2021
VLM: Task-agnostic Video-Language Model Pre-training for Video
  Understanding
VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Prahal Arora
Masoumeh Aminzadeh
Christoph Feichtenhofer
Florian Metze
Luke Zettlemoyer
327
146
0
20 May 2021
A Review on Explainability in Multimodal Deep Neural Nets
A Review on Explainability in Multimodal Deep Neural NetsIEEE Access (IEEE Access), 2021
Gargi Joshi
Rahee Walambe
K. Kotecha
373
171
0
17 May 2021
Survey of Visual-Semantic Embedding Methods for Zero-Shot Image
  Retrieval
Survey of Visual-Semantic Embedding Methods for Zero-Shot Image RetrievalInternational Conference on Machine Learning and Applications (ICMLA), 2021
K. Ueki
254
5
0
16 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic
  Survey
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
808
320
0
10 May 2021
Playing Lottery Tickets with Vision and Language
Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
300
62
0
23 Apr 2021
Detector-Free Weakly Supervised Grounding by Separation
Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
174
31
0
20 Apr 2021
Previous
123...1011789
Next