Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1908.06066
Cited By

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
Pre-training

v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019

16 August 2019

ArXiv (abs)PDF HTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown

Language-guided Human Motion Synthesis with Atomic Actions

Language-guided Human Motion Synthesis with Atomic ActionsACM Multimedia (ACM MM), 2023

191

19

0

18 Aug 2023

Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

190

3

0

18 Aug 2023

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
using Vision-Language Pre-Training Model

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training ModelACM Multimedia (ACM MM), 2023

174

18

0

02 Aug 2023

Robust Visual Question Answering: Datasets, Methods, and Future
Challenges

Robust Visual Question Answering: Datasets, Methods, and Future ChallengesIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Pinghui Wang

Jun Liu

333

45

0

21 Jul 2023

PAT: Parallel Attention Transformer for Visual Question Answering in
Vietnamese

PAT: Parallel Attention Transformer for Visual Question Answering in VietnameseInternational Conference on Multimedia Analysis and Pattern Recognition (ICMAPR), 2023

Nghia Hieu Nguyen

Kiet Van Nguyen

208

2

0

17 Jul 2023

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for
Vision and Language Decision Making

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

214

0

0

16 Jul 2023

Fine-grained Text-Video Retrieval with Frozen Image Encoders

Fine-grained Text-Video Retrieval with Frozen Image Encoders

409

1

0

14 Jul 2023

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

912

317

0

07 Jul 2023

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal AlignmentACM Multimedia (ACM MM), 2023

478

38

0

07 Jul 2023

Vision Language Transformers: A Survey

Vision Language Transformers: A Survey

182

7

0

06 Jul 2023

Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph
Reasoning

Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning

306

17

0

06 Jul 2023

S-Omninet: Structured Data Enhanced Universal Multimodal Learning
Architecture

S-Omninet: Structured Data Enhanced Universal Multimodal Learning Architecture

94

0

0

01 Jul 2023

Towards Open Vocabulary Learning: A Survey

Towards Open Vocabulary Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Xiangtai Li

...

Jiangning Zhang

406

218

0

28 Jun 2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching
Attention and Input

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and InputEuropean Conference on Computer Vision (ECCV), 2023

103

6

0

25 Jun 2023

Exploring the Role of Audio in Video Captioning

Exploring the Role of Audio in Video Captioning

Linjie Yang

Ehsan Elhamifar

Heng Wang

168

6

0

21 Jun 2023

Generation of Radiology Findings in Chest X-Ray by Leveraging
Collaborative Knowledge

Generation of Radiology Findings in Chest X-Ray by Leveraging Collaborative KnowledgeProcedia Computer Science (Procedia Comput. Sci.), 2023

Sanjeev Kumar Karn

Bogdan Georgescu

...

Lucian Mihai Itu

Oladimeji Farri

Dorin Comaniciu

159

9

0

18 Jun 2023

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive TrainingIEEE Transactions on Image Processing (IEEE TIP), 2023

Liang Wang

198

41

0

15 Jun 2023

A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks

A Comprehensive Survey on Applications of Transformers for Deep Learning TasksExpert systems with applications (ESWA), 2023

Witold Pedrycz

244

375

0

11 Jun 2023

Object Detection with Transformers: A Review

Object Detection with Transformers: A ReviewItalian National Conference on Sensors (INS), 2023

Tahira Shehzadi

Muhammad Zeshan Afzal

418

53

0

07 Jun 2023

Table and Image Generation for Investigating Knowledge of Entities in
Pre-trained Vision and Language Models

Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Hidetaka Kamigaito

Katsuhiko Hayashi

Taro Watanabe

169

1

0

03 Jun 2023

ManagerTower: Aggregating the Insights of Uni-Modal Experts for
Vision-Language Representation Learning

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Anahita Bhiwandiwalla

Shachar Rosenman

171

5

0

31 May 2023

Deeply Coupled Cross-Modal Prompt Learning

Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Wei Tang

209

21

0

29 May 2023

Training Data Extraction From Pre-trained Language Models: A Survey

Training Data Extraction From Pre-trained Language Models: A Survey

Shotaro Ishihara

281

53

0

25 May 2023

MMNet: Multi-Mask Network for Referring Image Segmentation

MMNet: Multi-Mask Network for Referring Image Segmentation

246

2

0

24 May 2023

UniChart: A Universal Vision-language Pretrained Model for Chart
Comprehension and Reasoning

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and ReasoningConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

343

160

0

24 May 2023

BigVideo: A Large-scale Video Subtitle Translation Dataset for
Multimodal Machine Translation

BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

376

15

0

23 May 2023

EDIS: Entity-Driven Image Search over Multimodal Web Content

EDIS: Entity-Driven Image Search over Multimodal Web ContentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

326

21

0

23 May 2023

Probing the Role of Positional Information in Vision-Language Models

Probing the Role of Positional Information in Vision-Language Models

Philipp J. Rösch

Jindrich Libovický

117

9

0

17 May 2023

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual
Question Answering in Vietnamese

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in VietnameseInformation Fusion (Inf. Fusion), 2023

Nghia Hieu Nguyen

Kiet Van Nguyen

Ngan Luu-Thuy Nguyen

194

27

0

07 May 2023

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured RepresentationsAAAI Conference on Artificial Intelligence (AAAI), 2023

Rongsheng Zhang

...

Zeng Zhao

308

49

0

06 May 2023

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

Paul N. Bennett

...

Yejin Choi

194

8

0

01 May 2023

Towards Medical Artificial General Intelligence via Knowledge-Enhanced
Multimodal Pretraining

Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Mingjie Li

Hang Xu

...

Xiaojun Chang

Xiaodan Liang

LM&MA MedIm AI4CE

225

17

0

26 Apr 2023

Rethinking Benchmarks for Cross-modal Image-text Retrieval

Rethinking Benchmarks for Cross-modal Image-text RetrievalAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023

Qin Jin

276

23

0

21 Apr 2023

W-MAE: Pre-trained weather model with masked autoencoder for
multi-variable weather forecasting

W-MAE: Pre-trained weather model with masked autoencoder for multi-variable weather forecasting

Chenghong Zhang

335

31

0

18 Apr 2023

Towards Robust Prompts on Vision-Language Models

Towards Robust Prompts on Vision-Language Models

Jindong Gu

253

10

0

17 Apr 2023

CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Shentong Mo

199

1

0

10 Apr 2023

Uncurated Image-Text Datasets: Shedding Light on Demographic Bias

Uncurated Image-Text Datasets: Shedding Light on Demographic BiasComputer Vision and Pattern Recognition (CVPR), 2023

196

71

0

06 Apr 2023

Self-Supervised Multimodal Learning: A Survey

Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Oisin Mac Aodha

Timothy M. Hospedales

319

89

0

31 Mar 2023

Borrowing Human Senses: Comment-Aware Self-Training for Social Media
Multimodal Classification

Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Chunpu Xu

127

5

0

27 Mar 2023

Transformers in Speech Processing: A Survey

Transformers in Speech Processing: A Survey

Heriberto Cuayáhuitl

Moazzam Shoukat

448

68

0

21 Mar 2023

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Global Knowledge Calibration for Fast Open-Vocabulary SegmentationIEEE International Conference on Computer Vision (ICCV), 2023

...

Yujiu Yang

261

47

0

16 Mar 2023

Refined Vision-Language Modeling for Fine-grained Multi-modal
Pre-training

Refined Vision-Language Modeling for Fine-grained Multi-modal Pre-training

137

1

0

09 Mar 2023

TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test
Questions

TQ-Net: Mixed Contrastive Representation Learning For Heterogeneous Test Questions

147

0

0

09 Mar 2023

Toward Unsupervised Realistic Visual Question Answering

Toward Unsupervised Realistic Visual Question AnsweringIEEE International Conference on Computer Vision (ICCV), 2023

Nuno Vasconcelos

279

2

0

09 Mar 2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023

Paul Hongsuck Seo

Jordi Pont-Tuset

Cordelia Schmid

497

325

0

27 Feb 2023

Improving Medical Speech-to-Text Accuracy with Vision-Language
Pre-training Model

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training ModelIEEE journal of biomedical and health informatics (IEEE JBHI), 2023

185

15

0

27 Feb 2023

Test-Time Distribution Normalization for Contrastively Learned
Vision-language Models

Test-Time Distribution Normalization for Contrastively Learned Vision-language ModelsNeural Information Processing Systems (NeurIPS), 2023

Ser-Nam Lim

244

21

0

22 Feb 2023

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023

Yaowei Wang

Yonghong Tian

467

272

0

20 Feb 2023

Rejecting Cognitivism: Computational Phenomenology for Deep Learning

Rejecting Cognitivism: Computational Phenomenology for Deep Learning

267

4

0

16 Feb 2023

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

Sainbayar Sukhbaatar

Lorenzo Torresani

214

8

0

16 Feb 2023

1 2 3 4 5 6...9 10 11