ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
Title
Generating Natural Questions from Images for Multimodal Assistants
Generating Natural Questions from Images for Multimodal AssistantsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Alkesh Patel
Sudarshan Ramanujam
Hadas Kotek
Christopher Klein
Jason D. Williams
VGen
158
9
0
17 Nov 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Andreas Veit
Kimberly Wilber
55
3
0
17 Nov 2020
ActBERT: Learning Global-Local Video-Text Representations
ActBERT: Learning Global-Local Video-Text RepresentationsComputer Vision and Pattern Recognition (CVPR), 2020
Linchao Zhu
Yi Yang
ViT
270
450
0
14 Nov 2020
Multimodal Pretraining for Dense Video Captioning
Multimodal Pretraining for Dense Video Captioning
Gabriel Huang
Bo Pang
Zhenhai Zhu
Clara E. Rivera
Radu Soricut
160
99
0
10 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
176
125
0
10 Nov 2020
Co-attentional Transformers for Story-Based Video Understanding
Co-attentional Transformers for Story-Based Video UnderstandingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Björn Bebensee
Byoung-Tak Zhang
112
7
0
27 Oct 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual
  Question Answering
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question AnsweringFindings (Findings), 2020
Aisha Urooj Khan
Amir Mazaheri
N. Lobo
M. Shah
193
61
0
27 Oct 2020
Unsupervised Vision-and-Language Pre-training Without Parallel Images
  and Captions
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li
Haoxuan You
Zhecan Wang
Alireza Zareian
Shih-Fu Chang
Kai-Wei Chang
SSLVLM
183
12
0
24 Oct 2020
ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken
  Language Understanding
ST-BERT: Cross-modal Language Model Pre-training For End-to-end Spoken Language UnderstandingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020
Minjeong Kim
Gyuwan Kim
Sang-Woo Lee
Jung-Woo Ha
VLM
170
37
0
23 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and
  Emerging Trends
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
265
6
0
19 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal
  Contrastive Learning
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
90
21
0
16 Oct 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence
  Representations
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
Fuli Luo
Pengcheng Yang
Shicheng Li
Xuancheng Ren
Xu Sun
VLMSSL
171
16
0
13 Oct 2020
Contrast and Classify: Training Robust VQA Models
Contrast and Classify: Training Robust VQA Models
Yash Kant
A. Moudgil
Dhruv Batra
Devi Parikh
Harsh Agrawal
131
5
0
13 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning
Beyond Language: Learning Commonsense from Images for ReasoningFindings (Findings), 2020
Wanqing Cui
Yanyan Lan
Liang Pang
Jiafeng Guo
Xueqi Cheng
LRM
129
5
0
10 Oct 2020
Learning to Represent Image and Text with Denotation Graph
Learning to Represent Image and Text with Denotation Graph
Bowen Zhang
Hexiang Hu
Vihan Jain
Eugene Ie
Fei Sha
152
22
0
06 Oct 2020
Support-set bottlenecks for video-text representation learning
Support-set bottlenecks for video-text representation learning
Mandela Patrick
Po-Yao (Bernie) Huang
Yuki M. Asano
Florian Metze
Alexander G. Hauptmann
João Henriques
Andrea Vedaldi
262
260
0
06 Oct 2020
Multi-Modal Open-Domain Dialogue
Multi-Modal Open-Domain DialogueConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Kurt Shuster
Eric Michael Smith
Da Ju
Jason Weston
AI4CE
259
48
0
02 Oct 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal
  Transformers
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLMMLLM
162
106
0
23 Sep 2020
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in
  Visual Question Answering
MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Tejas Gokhale
Pratyay Banerjee
Chitta Baral
Yezhou Yang
OOD
170
155
0
18 Sep 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
190
42
0
17 Sep 2020
Denoising Large-Scale Image Captioning from Alt-text Data using Content
  Selection Models
Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection ModelsInternational Conference on Computational Linguistics (COLING), 2020
Khyathi Chandu
Piyush Sharma
Soravit Changpinyo
Ashish V. Thapliyal
Radu Soricut
DiffMVLM
201
3
0
10 Sep 2020
Active Contrastive Learning of Audio-Visual Video Representations
Active Contrastive Learning of Audio-Visual Video Representations
Shuang Ma
Zhaoyang Zeng
Daniel J. McDuff
Yale Song
VLMSSL
160
9
0
31 Aug 2020
DeVLBert: Learning Deconfounded Visio-Linguistic Representations
DeVLBert: Learning Deconfounded Visio-Linguistic Representations
Shengyu Zhang
Tan Jiang
Tan Wang
Kun Kuang
Zhou Zhao
Jianke Zhu
Jin Yu
Hongxia Yang
Leilei Gan
OOD
171
93
0
16 Aug 2020
Weakly supervised cross-domain alignment with optimal transport
Weakly supervised cross-domain alignment with optimal transport
Siyang Yuan
Ke Bai
Liqun Chen
Yizhe Zhang
Chenyang Tao
Chunyuan Li
Guoyin Wang
Ricardo Henao
Lawrence Carin
OT
142
7
0
14 Aug 2020
SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation SpaceWorkshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc), 2020
Liu Yang
VLM
147
5
0
02 Aug 2020
Spatially Aware Multimodal Transformers for TextVQA
Spatially Aware Multimodal Transformers for TextVQAEuropean Conference on Computer Vision (ECCV), 2020
Yash Kant
Dhruv Batra
Peter Anderson
Alex Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
179
93
0
23 Jul 2020
Multimodal Text Style Transfer for Outdoor Vision-and-Language
  Navigation
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Wanrong Zhu
Xinze Wang
Tsu-Jui Fu
An Yan
P. Narayana
Kazoo Sone
Sugato Basu
Wenjie Wang
327
38
0
01 Jul 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through
  Scene Graph
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
360
399
0
30 Jun 2020
Video-Grounded Dialogues with Pretrained Generation Language Models
Video-Grounded Dialogues with Pretrained Generation Language Models
Hung Le
Guosheng Lin
138
31
0
27 Jun 2020
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Saeed Amizadeh
Hamid Palangi
Oleksandr Polozov
Yichen Huang
K. Koishida
NAILRM
292
68
0
20 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjDSSL
272
157
0
17 Jun 2020
VirTex: Learning Visual Representations from Textual Annotations
VirTex: Learning Visual Representations from Textual AnnotationsComputer Vision and Pattern Recognition (CVPR), 2020
Karan Desai
Justin Johnson
SSLVLM
424
465
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation
  Learning
Large-Scale Adversarial Training for Vision-and-Language Representation LearningNeural Information Processing Systems (NeurIPS), 2020
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjDVLM
338
535
0
11 Jun 2020
M3P: Learning Universal Representations via Multitask Multilingual
  Multimodal Pre-training
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training
Minheng Ni
Haoyang Huang
Lin Su
Edward Cui
Taroon Bharti
Lijuan Wang
Jianfeng Gao
Dongdong Zhang
Nan Duan
244
7
0
04 Jun 2020
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal
  Retrieval
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
D. Gao
Linbo Jin
Ben Chen
Minghui Qiu
Peng Li
Yi Wei
Yitao Hu
Haozhe Jasper Wang
OOD
201
146
0
20 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained
  Vision-and-Language Models
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
248
137
0
15 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Cross-media Structured Common Space for Multimedia Event ExtractionAnnual Meeting of the Association for Computational Linguistics (ACL), 2020
Pengfei Yu
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
151
116
0
05 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation
  Pre-training
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLMVLMOffRLAI4TS
645
536
0
01 May 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the
  Web
Improving Vision-and-Language Navigation with Image-Text Pairs from the WebEuropean Conference on Computer Vision (ECCV), 2020
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
404
256
0
30 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
VD-BERT: A Unified Vision and Dialog Transformer with BERTConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
307
107
0
28 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic
  pretraining
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
144
48
0
19 Apr 2020
Relation Transformer Network
Relation Transformer Network
Rajat Koner
Poulami Sinhamahapatra
Volker Tresp
ViT
291
35
0
13 Apr 2020
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksEuropean Conference on Computer Vision (ECCV), 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
...
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
707
2,123
0
13 Apr 2020
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
  Understanding and Generation
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yaobo Liang
Nan Duan
Yeyun Gong
Ning Wu
Fenfei Guo
...
Shuguang Liu
Fan Yang
Daniel Fernando Campos
Rangan Majumder
Ming Zhou
ELMVLM
292
367
0
03 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
358
467
0
02 Apr 2020
Pre-trained Models for Natural Language Processing: A Survey
Pre-trained Models for Natural Language Processing: A SurveyScience China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MAVLM
941
1,606
0
18 Mar 2020
IMRAM: Iterative Matching with Recurrent Attention Memory for
  Cross-Modal Image-Text Retrieval
IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text RetrievalComputer Vision and Pattern Recognition (CVPR), 2020
Hui Chen
Guiguang Ding
Xudong Liu
Zijia Lin
Ji Liu
Jungong Han
165
360
0
08 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
XGPT: Cross-modal Generative Pre-Training for Image CaptioningNatural Language Processing and Chinese Computing (NLPCC), 2020
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLMVLM
211
84
0
03 Mar 2020
Unshuffling Data for Improved Generalization
Unshuffling Data for Improved GeneralizationIEEE International Conference on Computer Vision (ICCV), 2020
Damien Teney
Ehsan Abbasnejad
Anton Van Den Hengel
OOD
209
82
0
27 Feb 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
191
7
0
25 Feb 2020
Previous
123...10119
Next