ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
Language-guided Human Motion Synthesis with Atomic Actions
Language-guided Human Motion Synthesis with Atomic ActionsACM Multimedia (ACM MM), 2023
Yuanhao Zhai
Mingzhen Huang
Tianyu Luan
Lu Dong
Ifeoma Nwogu
Siwei Lyu
David Doermann
Junsong Yuan
191
19
0
18 Aug 2023
Artificial-Spiking Hierarchical Networks for Vision-Language
  Representation Learning
Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning
Ye-Ting Chen
Siyu Zhang
Yaoru Sun
Weijian Liang
Haoran Wang
190
3
0
18 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
  using Vision-Language Pre-Training Model
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training ModelACM Multimedia (ACM MM), 2023
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIPVLM
174
18
0
02 Aug 2023
Robust Visual Question Answering: Datasets, Methods, and Future
  Challenges
Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jie Ma
Pinghui Wang
Dechen Kong
Zewei Wang
Jun Liu
Hongbin Pei
Junzhou Zhao
OOD
333
45
0
21 Jul 2023
PAT: Parallel Attention Transformer for Visual Question Answering in
  Vietnamese
PAT: Parallel Attention Transformer for Visual Question Answering in VietnameseInternational Conference on Multimedia Analysis and Pattern Recognition (ICMAPR), 2023
Nghia Hieu Nguyen
Kiet Van Nguyen
208
2
0
17 Jul 2023
Breaking Down the Task: A Unit-Grained Hybrid Training Framework for
  Vision and Language Decision Making
Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making
Ruipu Luo
Jiwen Zhang
Zhongyu Wei
VLM
214
0
0
16 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
409
1
0
14 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLMVLM
912
317
0
07 Jul 2023
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal AlignmentACM Multimedia (ACM MM), 2023
Chunhui Zhang
Xin Sun
Li Liu
Yiqian Yang
Qiong Liu
Xiaoping Zhou
Yanfeng Wang
478
38
0
07 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
182
7
0
06 Jul 2023
Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph
  Reasoning
Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning
K. Liang
Sihang Zhou
Yue Liu
Lingyuan Meng
Meng Liu
Xinwang Liu
306
17
0
06 Jul 2023
S-Omninet: Structured Data Enhanced Universal Multimodal Learning
  Architecture
S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture
Ye Xue
Diego Klabjan
J. Utke
94
0
0
01 Jul 2023
Towards Open Vocabulary Learning: A Survey
Towards Open Vocabulary Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjDVLM
406
218
0
28 Jun 2023
Switch-BERT: Learning to Model Multimodal Interactions by Switching
  Attention and Input
Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023
Qingpei Guo
Kaisheng Yao
Wei Chu
MLLM
103
6
0
25 Jun 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
168
6
0
21 Jun 2023
Generation of Radiology Findings in Chest X-Ray by Leveraging
  Collaborative Knowledge
Generation of Radiology Findings in Chest X-Ray by Leveraging Collaborative KnowledgeProcedia Computer Science (Procedia Comput. Sci.), 2023
Manuela Danu
George Marica
Sanjeev Kumar Karn
Bogdan Georgescu
Awais Mansoor
...
Lucian Mihai Itu
C. Suciu
Sasa Grbic
Oladimeji Farri
Dorin Comaniciu
MedIm
159
9
0
18 Jun 2023
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
  Contrastive Training
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive TrainingIEEE Transactions on Image Processing (IEEE TIP), 2023
Chong Liu
Yuqi Zhang
Hongsong Wang
Weihua Chen
F. Wang
Yan Huang
Yixing Shen
Liang Wang
198
41
0
15 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning
  Tasks
A Comprehensive Survey on Applications of Transformers for Deep Learning TasksExpert systems with applications (ESWA), 2023
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViTMedIm
244
375
0
11 Jun 2023
Object Detection with Transformers: A Review
Object Detection with Transformers: A ReviewItalian National Conference on Sensors (INS), 2023
Tahira Shehzadi
K. Hashmi
D. Stricker
Muhammad Zeshan Afzal
ViTMU
418
53
0
07 Jun 2023
Table and Image Generation for Investigating Knowledge of Entities in
  Pre-trained Vision and Language Models
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
VLM
169
1
0
03 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for
  Vision-Language Representation Learning
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFinVLM
171
5
0
31 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
209
21
0
29 May 2023
Training Data Extraction From Pre-trained Language Models: A Survey
Training Data Extraction From Pre-trained Language Models: A Survey
Shotaro Ishihara
281
53
0
25 May 2023
MMNet: Multi-Mask Network for Referring Image Segmentation
MMNet: Multi-Mask Network for Referring Image Segmentation
Yimin Yan
Xingjian He
Wenxuan Wan
Qingbin Liu
EgoV
246
2
0
24 May 2023
UniChart: A Universal Vision-language Pretrained Model for Chart
  Comprehension and Reasoning
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and ReasoningConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ahmed Masry
P. Kavehzadeh
Do Xuan Long
Enamul Hoque
Shafiq Joty
LRM
343
160
0
24 May 2023
BigVideo: A Large-scale Video Subtitle Translation Dataset for
  Multimodal Machine Translation
BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Liyan Kang
Luyang Huang
Ningxin Peng
Peihao Zhu
Zewei Sun
Shanbo Cheng
Mingxuan Wang
Degen Huang
Jinsong Su
376
15
0
23 May 2023
EDIS: Entity-Driven Image Search over Multimodal Web Content
EDIS: Entity-Driven Image Search over Multimodal Web ContentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Siqi Liu
Weixi Feng
Tsu-Jui Fu
Wenhu Chen
Wenjie Wang
VLM
326
21
0
23 May 2023
Probing the Role of Positional Information in Vision-Language Models
Probing the Role of Positional Information in Vision-Language Models
Philipp J. Rösch
Jindrich Libovický
117
9
0
17 May 2023
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
  Question Answering in Vietnamese
OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseInformation Fusion (Inf. Fusion), 2023
Nghia Hieu Nguyen
Duong T.D. Vo
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
194
27
0
07 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
  Structured Representations
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured RepresentationsAAAI Conference on Artificial Intelligence (AAAI), 2023
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
308
49
0
06 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
Jinho Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
194
8
0
01 May 2023
Towards Medical Artificial General Intelligence via Knowledge-Enhanced
  Multimodal Pretraining
Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining
Bingqian Lin
Zicong Chen
Mingjie Li
Haokun Lin
Hang Xu
...
Ling-Hao Chen
Xiaojun Chang
Yi Yang
L. Xing
Xiaodan Liang
LM&MAMedImAI4CE
225
17
0
26 Apr 2023
Rethinking Benchmarks for Cross-modal Image-text Retrieval
Rethinking Benchmarks for Cross-modal Image-text RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Wei Chen
Linli Yao
Qin Jin
VLM
276
23
0
21 Apr 2023
W-MAE: Pre-trained weather model with masked autoencoder for
  multi-variable weather forecasting
W-MAE: Pre-trained weather model with masked autoencoder for multi-variable weather forecasting
Xin Man
Chenghong Zhang
Jin Feng
Changyu Li
Jie Shao
AI4Cl
335
31
0
18 Apr 2023
Towards Robust Prompts on Vision-Language Models
Towards Robust Prompts on Vision-Language Models
Jindong Gu
Ahmad Beirami
Xuezhi Wang
Alex Beutel
Juil Sock
Yao Qin
VLMVPVLM
253
10
0
17 Apr 2023
CAVL: Learning Contrastive and Adaptive Representations of Vision and
  Language
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language
Shentong Mo
Jingfei Xia
Ihor Markevych
CLIPVLM
199
1
0
10 Apr 2023
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias
Uncurated Image-Text Datasets: Shedding Light on Demographic BiasComputer Vision and Pattern Recognition (CVPR), 2023
Noa Garcia
Yusuke Hirota
Yankun Wu
Yuta Nakashima
EGVM
196
71
0
06 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
319
89
0
31 Mar 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media
  Multimodal Classification
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Chunpu Xu
Jing Li
VLM
127
5
0
27 Mar 2023
Transformers in Speech Processing: A Survey
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
448
68
0
21 Mar 2023
Global Knowledge Calibration for Fast Open-Vocabulary Segmentation
Global Knowledge Calibration for Fast Open-Vocabulary SegmentationIEEE International Conference on Computer Vision (ICCV), 2023
Kunyang Han
Yong-Jin Liu
Jun Hao Liew
Henghui Ding
Yunchao Wei
...
Yitong Wang
Yansong Tang
Yujiu Yang
Jiashi Feng
Yao-Min Zhao
VLM
261
47
0
16 Mar 2023
Refined Vision-Language Modeling for Fine-grained Multi-modal
  Pre-training
Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training
Lisai Zhang
Qingcai Chen
Zhijian Chen
Yunpeng Han
Zhonghua Li
Bo Zhao
VLM
137
1
0
09 Mar 2023
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test
  Questions
TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions
He Zhu
Xihua Li
Xuemin Zhao
Yunbo Cao
Shan Yu
147
0
0
09 Mar 2023
Toward Unsupervised Realistic Visual Question Answering
Toward Unsupervised Realistic Visual Question AnsweringIEEE International Conference on Computer Vision (ICCV), 2023
Yuwei Zhang
Chih-Hui Ho
Nuno Vasconcelos
CoGe
279
2
0
09 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
  Video Captioning
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TSVLM
497
325
0
27 Feb 2023
Improving Medical Speech-to-Text Accuracy with Vision-Language
  Pre-training Model
Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training ModelIEEE journal of biomedical and health informatics (IEEE JBHI), 2023
Jaeyoung Huh
Sangjoon Park
Jeonghyeon Lee
Jong Chul Ye
LM&MA
185
15
0
27 Feb 2023
Test-Time Distribution Normalization for Contrastively Learned
  Vision-language Models
Test-Time Distribution Normalization for Contrastively Learned Vision-language ModelsNeural Information Processing Systems (NeurIPS), 2023
Yi Zhou
Juntao Ren
Fengyu Li
Ramin Zabih
Ser-Nam Lim
VLM
244
21
0
22 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
467
272
0
20 Feb 2023
Rejecting Cognitivism: Computational Phenomenology for Deep Learning
Rejecting Cognitivism: Computational Phenomenology for Deep Learning
P. Beckmann
G. Köstner
Ines Hipólito
267
4
0
16 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
214
8
0
16 Feb 2023
Previous
123456...91011
Next