ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
Title
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
Advik Sinha
Saurabh Atreya
Aashutosh A V
Sk Aziz Ali
Abhijit Das
CLIP
120
0
0
25 Nov 2025
Masked Diffusion Captioning for Visual Feature Learning
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
227
0
0
30 Oct 2025
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
Xin Jin
MLLM
328
0
0
16 Oct 2025
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation
Yubin Chen
Xuyang Guo
Zhenmei Shi
Zhao Song
Jiahao Zhang
VGen
619
8
0
24 Jul 2025
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong-Jin Liu
SongLi Wu
Sule Bai
Jiahao Wang
Yitong Wang
Yansong Tang
VLMVOS
286
2
0
19 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoEVLM
300
0
0
13 Jun 2025
Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
Yibo Cui
Liang Xie
Yu Zhao
Jiawei Sun
Erwei Yin
155
2
0
10 Jun 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
882
1
0
28 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
260
1
0
27 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
909
0
0
11 Apr 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
440
3
0
23 Mar 2025
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
Yang Xiao
Wang Lu
Jie Ji
Ruimeng Ye
Gen Li
Xiaolong Ma
Bo Hui
OT
286
0
0
09 Mar 2025
Composed Multi-modal Retrieval: A Survey of Approaches and Applications
Composed Multi-modal Retrieval: A Survey of Approaches and Applications
Kun Zhang
Jingyu Li
Zhiyu Li
Jingjing Zhang
F. Li
...
Nan Chen
Lei Zhang
Yongdong Zhang
Zhendong Mao
S.Kevin Zhou
389
1
0
03 Mar 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksComputer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjDVLM
504
8
0
14 Jan 2025
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
Yuanpeng Tu
Xi Chen
Ser-Nam Lim
Hengshuang Zhao
443
1
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
755
117
0
03 Jan 2025
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang
Hang Zhang
Xin Li
Jiashuo Sun
Yongliang Shen
Weiming Lu
Deli Zhao
Yueting Zhuang
Lidong Bing
VLM
555
5
0
01 Jan 2025
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Dengke Zhang
Fagui Liu
Quan Tang
VLM
592
2
0
15 Nov 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream
  Generalization of CLIP
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIPNeural Information Processing Systems (NeurIPS), 2024
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
239
4
0
31 Oct 2024
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using
  Transformer-based Method in Vietnamese Text-based Visual Question Answering
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question AnsweringPacific Asia Conference on Language, Information and Computation (PACLIC), 2024
Nghia Hieu Nguyen
Tho Thanh Quan
Ngan Luu-Thuy Nguyen
207
0
0
18 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for
  Vision-Language Pre-Training
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-TrainingACM Multimedia (ACM MM), 2022
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
357
9
0
16 Oct 2024
Leveraging Customer Feedback for Multi-modal Insight Extraction
Leveraging Customer Feedback for Multi-modal Insight ExtractionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Sandeep Sricharan Mukku
Abinesh Kanagarajan
Pushpendu Ghosh
Chetan Aggarwal
162
0
0
13 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
349
1
0
01 Oct 2024
VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage
  $\underline{P}$re-training Framework for $\underline{Ro}$botic and
  Laparoscopic Surgery
VidLPRO: A Vid‾\underline{Vid}Vid​eo-L‾\underline{L}L​anguage P‾\underline{P}P​re-training Framework for Ro‾\underline{Ro}Ro​botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
336
5
0
07 Sep 2024
A Survey on Integrated Sensing, Communication, and Computation
A Survey on Integrated Sensing, Communication, and ComputationIEEE Communications Surveys and Tutorials (COMST), 2024
Dingzhu Wen
Yong Zhou
Xiaoyang Li
Yuanming Shi
Kaibin Huang
Khaled B. Letaief
216
109
0
15 Aug 2024
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation
ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic SegmentationComputer Vision and Pattern Recognition (CVPR), 2024
Jingyun Wang
Guoliang Kang
VLMSSL
422
12
0
13 Aug 2024
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Efficient and Versatile Robust Fine-Tuning of Zero-shot ModelsEuropean Conference on Computer Vision (ECCV), 2024
Sungyeon Kim
Boseung Jeong
Donghyun Kim
Suha Kwak
VLM
183
8
0
11 Aug 2024
FlexAttention for Efficient High-Resolution Vision-Language Models
FlexAttention for Efficient High-Resolution Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Junyan Li
Delin Chen
Tianle Cai
Peihao Chen
Yining Hong
Zhenfang Chen
Yikang Shen
Chuang Gan
VLM
234
7
0
29 Jul 2024
HAPFI: History-Aware Planning based on Fused Information
HAPFI: History-Aware Planning based on Fused Information
Sujin Jeon
Suyeon Shin
Byoung-Tak Zhang
168
1
0
23 Jul 2024
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models
  Through 3D Reconstruction
I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Zaiqiao Meng
Hao Zhou
Yifang Chen
198
5
0
19 Jul 2024
Precision at Scale: Domain-Specific Datasets On-Demand
Precision at Scale: Domain-Specific Datasets On-Demand
Jesús M. Rodríguez-de-Vera
Imanol G. Estepa
Ignacio Sarasúa
Bhalaji Nagarajan
Petia Radeva
233
2
0
03 Jul 2024
Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting
  Process: Methodology and Benchmark
Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark
Gaochang Wu
Yapeng Zhang
Lan Deng
Jingxin Zhang
Tianyou Chai
184
1
0
13 Jun 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal
  Hierarchical-Cross-Attention Model
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
186
1
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
163
8
0
11 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAMLVLM
385
27
0
08 Jun 2024
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for
  Image-Text Matching
Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching
Xuri Ge
Fuhai Chen
Songpei Xu
Fuxiang Tao
Jie Wang
Joemon M. Jose
199
3
0
05 Jun 2024
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Cristian Rodriguez-Opazo
Ehsan Abbasnejad
Damien Teney
Edison Marrese-Taylor
Hamed Damirchi
Anton Van Den Hengel
VLM
334
1
0
27 May 2024
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin
M. F. Ahmed
Md. Mushtaq Shahriyar Rafee
VLM
254
6
0
19 May 2024
SignAvatar: Sign Language 3D Motion Reconstruction and Generation
SignAvatar: Sign Language 3D Motion Reconstruction and GenerationIEEE International Conference on Automatic Face & Gesture Recognition (FG), 2024
Lu Dong
Lipisha Chaudhary
Fei Xu
Xiao Wang
Mason Lary
Ifeoma Nwogu
SLR
152
11
0
13 May 2024
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial
  Self-Highlighting
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
Xuri Ge
Songpei Xu
Fuhai Chen
Jie Wang
Guoxin Wang
Shan An
Joemon M. Jose
3DPC
278
22
0
26 Apr 2024
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
Xuzheng Yu
Chen Jiang
Xingning Dong
Tian Gan
Ming Yang
Qingpei Guo
348
4
0
22 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
Ngan Luu-Thuy Nguyen
CoGe
554
8
0
16 Apr 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
392
18
0
16 Apr 2024
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Jingxuan Xu
Wuyang Chen
Yao-Min Zhao
Yunchao Wei
VLM
237
1
0
11 Apr 2024
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Hyperbolic Learning with Synthetic Captions for Open-World Detection
Fanjie Kong
Yanbei Chen
Jiarui Cai
Davide Modolo
VLMObjD
206
14
0
07 Apr 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Shiyang Feng
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Jiaming Song
VLM
363
84
0
29 Mar 2024
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
Xixuan Hao
Wei Chen
Yibo Yan
Siru Zhong
Kun Wang
Qingsong Wen
Yuxuan Liang
VLM
317
1
0
25 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at ScaleComputer Vision and Pattern Recognition (CVPR), 2024
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul Chilimbi
VLMAI4TS
180
8
0
21 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
  Objects in 3D Scenes
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
208
16
0
12 Mar 2024
Towards Deviation-Robust Agent Navigation via Perturbation-Aware
  Contrastive Learning
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Bingqian Lin
Yanxin Long
Yi Zhu
Fengda Zhu
Xiaodan Liang
QiXiang Ye
Liang Lin
207
7
0
09 Mar 2024
1234...91011
Next