ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGeVLM
308
14
0
30 May 2022
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
VD-PCR: Improving Visual Dialog with Pronoun Coreference ResolutionPattern Recognition (Pattern Recogn.), 2022
Xintong Yu
Hongming Zhang
Ruixin Hong
Yangqiu Song
Changshui Zhang
181
17
0
29 May 2022
Generalizing Multimodal Pre-training into Multilingual via Language
  Acquisition
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Liang Zhang
Anwen Hu
Qin Jin
VLM
141
6
0
29 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally
  Spreading Out Disinformation
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
121
15
0
25 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
  Retrieval
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLMCLIP
226
6
0
24 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case
  Study in Self-Rationalization
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-RationalizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
255
4
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLMMLLM
256
43
0
23 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
314
39
0
10 May 2022
Joint learning of object graph and relation graph for visual question
  answering
Joint learning of object graph and relation graph for visual question answeringIEEE International Conference on Multimedia and Expo (ICME), 2022
Hao Li
Xu Li
Belhal Karimi
Jie Chen
Mingming Sun
GNN
141
26
0
09 May 2022
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Wei Feng
Xingyuan Bu
Chenchen Zhang
Xubin Li
VLM
148
5
0
09 May 2022
CCMB: A Large-scale Chinese Cross-modal Benchmark
CCMB: A Large-scale Chinese Cross-modal BenchmarkACM Multimedia (ACM MM), 2022
Chunyu Xie
Heng Cai
Jincheng Li
Fanjing Kong
Xiaoyu Wu
...
Xiangzheng Zhang
Dawei Leng
Baochang Zhang
Xiangyang Ji
Yafeng Deng
MLLMVLM
273
21
0
08 May 2022
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
  Prefix for Multimodal Entity and Relation Extraction
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
Xiang Chen
Ningyu Zhang
Lei Li
Yunzhi Yao
Shumin Deng
Chuanqi Tan
Fei Huang
Luo Si
Huajun Chen
130
46
0
07 May 2022
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
  Graph Completion
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph CompletionAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Xiang Chen
Ningyu Zhang
Lei Li
Shumin Deng
Chuanqi Tan
Changliang Xu
Fei Huang
Luo Si
Huajun Chen
222
196
0
04 May 2022
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
  Pretraining
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model PretrainingNeural Information Processing Systems (NeurIPS), 2022
Yuting Gao
Jinfeng Liu
Zihan Xu
Jinchao Zhang
Ke Li
Rongrong Ji
Chunhua Shen
VLMCLIP
403
141
0
29 Apr 2022
CapOnImage: Context-driven Dense-Captioning on Image
CapOnImage: Context-driven Dense-Captioning on ImageConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yiqi Gao
Xinglin Hou
Yuanmeng Zhang
Bo Xiao
Yuning Jiang
Peifeng Wang
189
13
0
27 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Guohao Li
M. Donoser
Loris Bazzani
189
25
0
26 Apr 2022
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
Progressive Learning for Image Retrieval with Hybrid-Modality QueriesAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Yida Zhao
Yuqing Song
Qin Jin
188
40
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
145
10
0
23 Apr 2022
Unified Pretraining Framework for Document Understanding
Unified Pretraining Framework for Document UnderstandingNeural Information Processing Systems (NeurIPS), 2022
Jiuxiang Gu
Jason Kuen
Vlad I. Morariu
Handong Zhao
Nikolaos Barmpalios
R. Jain
A. Nenkova
Tong Sun
272
111
0
22 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
216
2
0
22 Apr 2022
Making the Most of Text Semantics to Improve Biomedical Vision--Language
  Processing
Making the Most of Text Semantics to Improve Biomedical Vision--Language ProcessingEuropean Conference on Computer Vision (ECCV), 2022
Benedikt Boecking
Naoto Usuyama
Shruthi Bannur
Daniel Coelho De Castro
Anton Schwaighofer
...
Tristan Naumann
A. Nori
Javier Alvarez-Valle
Hoifung Poon
Ozan Oktay
486
358
0
21 Apr 2022
Imagination-Augmented Natural Language Understanding
Imagination-Augmented Natural Language UnderstandingNorth American Chapter of the Association for Computational Linguistics (NAACL), 2022
Yujie Lu
Wanrong Zhu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
216
25
0
18 Apr 2022
End-to-end Dense Video Captioning as Sequence Generation
End-to-end Dense Video Captioning as Sequence GenerationInternational Conference on Computational Linguistics (COLING), 2022
Wanrong Zhu
Bo Pang
Ashish V. Thapliyal
William Yang Wang
Radu Soricut
DiffM
216
45
0
18 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for
  Vision-and-Language Tasks
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language TasksIEEE Transactions on Image Processing (IEEE TIP), 2022
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
153
57
0
16 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
  Cross-Modal Retrieval
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalComputer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIPVLM
254
55
0
15 Apr 2022
Vision-and-Language Pretrained Models: A Survey
Vision-and-Language Pretrained Models: A SurveyInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
422
71
0
15 Apr 2022
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Shunyu Zhang
X. Jiang
Zequn Yang
T. Wan
Zengchang Qin
164
14
0
10 Apr 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource
  Parallel Data
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel DataAAAI Conference on Artificial Intelligence (AAAI), 2022
Yunxing Kang
Tianqiao Liu
Hang Li
Y. Hao
Wenbiao Ding
164
9
0
10 Apr 2022
Temporal Alignment Networks for Long-term Video
Temporal Alignment Networks for Long-term VideoComputer Vision and Pattern Recognition (CVPR), 2022
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
169
104
0
06 Apr 2022
SimVQA: Exploring Simulated Environments for Visual Question Answering
SimVQA: Exploring Simulated Environments for Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2022
Paola Cascante-Bonilla
Hui Wu
Letao Wang
Rogerio Feris
Vicente Ordonez
209
9
0
31 Mar 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
ViSTA: Vision and Scene Text Aggregation for Cross-Modal RetrievalComputer Vision and Pattern Recognition (CVPR), 2022
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
277
72
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TubeDETR: Spatio-Temporal Video Grounding with TransformersComputer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Image-text Retrieval: A Survey on Recent Research and Development
Image-text Retrieval: A Survey on Recent Research and DevelopmentInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Min Cao
Shiping Li
Juntao Li
Liqiang Nie
Min Zhang
336
108
0
28 Mar 2022
Large-scale Bilingual Language-Image Contrastive Learning
Large-scale Bilingual Language-Image Contrastive Learning
ByungSoo Ko
Geonmo Gu
VLM
257
17
0
28 Mar 2022
Modality Competition: What Makes Joint Training of Multi-modal Network
  Fail in Deep Learning? (Provably)
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)International Conference on Machine Learning (ICML), 2022
Yu Huang
Junyang Lin
Chang Zhou
Hongxia Yang
Longbo Huang
171
144
0
23 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video
  Segmentation
Local-Global Context Aware Transformer for Language-Guided Video SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
322
100
0
18 Mar 2022
Deep Unsupervised Hashing with Latent Semantic Components
Deep Unsupervised Hashing with Latent Semantic ComponentsAAAI Conference on Artificial Intelligence (AAAI), 2022
Qinghong Lin
Xiaojun Chen
Qin Zhang
Shao-Qian Cai
Wenzhe Zhao
Hongfa Wang
238
3
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
UNIMO-2: End-to-End Unified Vision-Language Grounded LearningFindings (Findings), 2022
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
145
24
0
17 Mar 2022
The Principle of Diversity: Training Stronger Vision Transformers Calls
  for Reducing All Levels of Redundancy
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of RedundancyComputer Vision and Pattern Recognition (CVPR), 2022
Tianlong Chen
Zhenyu Zhang
Yu Cheng
Ahmed Hassan Awadallah
Zinan Lin
ViT
256
49
0
12 Mar 2022
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text
  Retrieval
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
Jie Lei
Xinlei Chen
Ning Zhang
Meng-xing Wang
Joey Tianyi Zhou
Tamara L. Berg
Licheng Yu
229
15
0
10 Mar 2022
Visual-Language Navigation Pretraining via Prompt-based Environmental
  Self-exploration
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-explorationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Xiwen Liang
Fengda Zhu
Lingling Li
Hang Xu
Xiaodan Liang
LM&RoVLM
119
33
0
08 Mar 2022
Language Matters: A Weakly Supervised Vision-Language Pre-training
  Approach for Scene Text Detection and Spotting
Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and SpottingEuropean Conference on Computer Vision (ECCV), 2022
Chuhui Xue
Wenqing Zhang
Yu Hao
Shijian Lu
Juil Sock
Song Bai
VLM
265
46
0
08 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility
  Concern about Image-Text Retrieval
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Jun Rao
Haiwei Yang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
236
34
0
08 Mar 2022
Find a Way Forward: a Language-Guided Semantic Map Navigator
Find a Way Forward: a Language-Guided Semantic Map Navigator
Zehao Wang
Mingxiao Li
Minye Wu
Marie-Francine Moens
Tinne Tuytelaars
LM&Ro
144
4
0
07 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large
  Models
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TSVLM
204
41
0
03 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based
  Multi-Granular Alignment
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular AlignmentComputer Vision and Pattern Recognition (CVPR), 2022
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
158
35
0
01 Mar 2022
Multi-modal Alignment using Representation Codebook
Multi-modal Alignment using Representation CodebookComputer Vision and Pattern Recognition (CVPR), 2022
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
486
78
0
28 Feb 2022
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
COMPASS: Contrastive Multimodal Pretraining for Autonomous SystemsIEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Shuang Ma
Sai H. Vemprala
Wenshan Wang
Jayesh K. Gupta
Yale Song
Daniel J. McDuff
Ashish Kapoor
SSL
188
12
0
20 Feb 2022
A Survey of Vision-Language Pre-Trained Models
A Survey of Vision-Language Pre-Trained ModelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2022
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
396
241
0
18 Feb 2022
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint
  Modeling Method for Multi-task Misogyny Identification
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny IdentificationInternational Workshop on Semantic Evaluation (SemEval), 2022
Da Li
Ming Yi
Yukai He
141
2
0
18 Feb 2022
Previous
123...567...91011
Next