ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
Multi-modal Machine Learning in Engineering Design: A Review and Future
  Directions
Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
356
64
0
14 Feb 2023
Paparazzi: A Deep Dive into the Capabilities of Language and Vision
  Models for Grounding Viewpoint Descriptions
Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint DescriptionsFindings (Findings), 2023
Henrik Voigt
J. Hombeck
M. Meuschke
K. Lawonn
Sina Zarrieß
VLM
229
2
0
13 Feb 2023
VITR: Augmenting Vision Transformers with Relation-Focused Learning for
  Cross-Modal Information Retrieval
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information RetrievalACM Transactions on Knowledge Discovery from Data (TKDD), 2023
Yansong Gong
Georgina Cosma
Axel Finke
ViT
299
4
0
13 Feb 2023
Actional Atomic-Concept Learning for Demystifying Vision-Language
  Navigation
Actional Atomic-Concept Learning for Demystifying Vision-Language NavigationAAAI Conference on Artificial Intelligence (AAAI), 2023
Bingqian Lin
Yi Zhu
Xiaodan Liang
Liang Lin
Jian-zhuo Liu
CoGeLM&Ro
290
5
0
13 Feb 2023
Unified Vision-Language Representation Modeling for E-Commerce
  Same-Style Products Retrieval
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products RetrievalThe Web Conference (WWW), 2023
Ben Chen
Linbo Jin
Xinxin Wang
D. Gao
Wen Jiang
Wei Ning
311
10
0
10 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Learning to Agree on Vision Attention for Visual Commonsense ReasoningIEEE transactions on multimedia (IEEE TMM), 2023
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
266
12
0
04 Feb 2023
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
  Semantic Consistency
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic ConsistencyInternational Conference on Learning Representations (ICLR), 2023
Pengzhen Ren
Changlin Li
Hang Xu
Yi Zhu
Guangrun Wang
Jian-zhuo Liu
Xiaojun Chang
Xiaodan Liang
209
57
0
31 Jan 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual
  Loss
Effective End-to-End Vision Language Pretraining with Semantic Visual LossIEEE transactions on multimedia (IEEE TMM), 2023
Xiaofeng Yang
Fayao Liu
Guosheng Lin
VLM
97
14
0
18 Jan 2023
CLIP the Gap: A Single Domain Generalization Approach for Object
  Detection
CLIP the Gap: A Single Domain Generalization Approach for Object DetectionComputer Vision and Pattern Recognition (CVPR), 2023
Vidit Vidit
Martin Engilberge
Mathieu Salzmann
VLMObjD
246
136
0
13 Jan 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language
  Models for Knowledge-based Visual Reasoning
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen
Qinhong Zhou
Songlin Yang
Yining Hong
Hao Zhang
Chuang Gan
LRMVLM
269
54
0
12 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A
  Reproducibility Study
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility StudyEuropean Conference on Information Retrieval (ECIR), 2023
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
300
5
0
12 Jan 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question
  Answering
Multimodal Inverse Cloze Task for Knowledge-based Visual Question AnsweringEuropean Conference on Information Retrieval (ECIR), 2023
Paul Lerner
O. Ferret
C. Guinaudeau
234
12
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
Universal Multimodal Representation for Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zhuosheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
270
30
0
09 Jan 2023
Text2Poster: Laying out Stylized Texts on Retrieved Images
Text2Poster: Laying out Stylized Texts on Retrieved ImagesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Chuhao Jin
Hongteng Xu
Ruihua Song
Zhiwu Lu
DiffM
135
11
0
06 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Test of Time: Instilling Video-Language Models with a Sense of TimeComputer Vision and Pattern Recognition (CVPR), 2023
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
460
47
0
05 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with
  Pre-Training Methods
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training MethodsComputer Vision and Pattern Recognition (CVPR), 2023
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
175
20
0
05 Jan 2023
BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
Haowen Hou
Xiaopeng Yan
Yigeng Zhang
Fengzong Lian
Zhanhui Kang
BDL
127
2
0
29 Dec 2022
On Transforming Reinforcement Learning by Transformer: The Development
  Trajectory
On Transforming Reinforcement Learning by Transformer: The Development TrajectoryIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Shengchao Hu
Li Shen
Ya Zhang
Yixin Chen
Dacheng Tao
OffRL
340
61
0
29 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Position-guided Text Prompt for Vision-Language Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2022
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
170
46
0
19 Dec 2022
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal
  Contributions in Vision and Language Models & Tasks
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Letitia Parcalabescu
Anette Frank
218
49
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
NLIP: Noise-robust Language-Image Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2022
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
250
38
0
14 Dec 2022
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised
  Video Anomaly Detection
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly DetectionInternational Conference on Information Photonics (ICIP), 2022
Kevin Hyekang Joo
Khoa T. Vo
Kashu Yamazaki
Ngan Le
229
93
0
09 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food
  Retrieval
Vision and Structured-Language Pretraining for Cross-Modal Food RetrievalComputer Vision and Image Understanding (CVIU), 2022
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIPCoGe
270
15
0
08 Dec 2022
CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for
  Referring Image Segmentation
CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image SegmentationNeural Information Processing Systems (NeurIPS), 2022
Zicheng Zhang
Yi Zhu
Jian-zhuo Liu
Xiaodan Liang
Wei Ke
219
36
0
04 Dec 2022
Protein Language Models and Structure Prediction: Connection and
  Progression
Protein Language Models and Structure Prediction: Connection and Progression
Bozhen Hu
Jun Xia
Jiangbin Zheng
Cheng Tan
Yufei Huang
Yongjie Xu
Stan Z. Li
210
45
0
30 Nov 2022
Improving Commonsense in Vision-Language Models via Knowledge Graph
  Riddles
Improving Commonsense in Vision-Language Models via Knowledge Graph RiddlesComputer Vision and Pattern Recognition (CVPR), 2022
Shuquan Ye
Yujia Xie
Dongdong Chen
Yichong Xu
Lu Yuan
Chenguang Zhu
Jing Liao
VLM
135
18
0
29 Nov 2022
Unified Multimodal Model with Unlikelihood Training for Visual Dialog
Unified Multimodal Model with Unlikelihood Training for Visual DialogACM Multimedia (ACM MM), 2022
Zihao Wang
Junli Wang
Changjun Jiang
MLLM
180
13
0
23 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
  Latent Attention
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent AttentionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
178
10
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLMCLIP
131
8
0
21 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
  Unified Vision Language Model
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language ModelComputer Vision and Pattern Recognition (CVPR), 2022
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
209
49
0
21 Nov 2022
Detect Only What You Specify : Object Detection with Linguistic Target
Detect Only What You Specify : Object Detection with Linguistic Target
Moyuru Yamada
ObjDVLM
94
0
0
18 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
241
55
0
17 Nov 2022
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal
  Pre-trained Knowledge
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained KnowledgeThe Web Conference (WWW), 2022
Linli Yao
Wei Chen
Qin Jin
VLM
318
11
0
17 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Grafting Pre-trained Models for Multimodal Headline GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
238
6
0
14 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
CLOP: Video-and-Language Pre-Training with Knowledge RegularizationsACM Multimedia (ACM MM), 2022
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
171
2
0
07 Nov 2022
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
  Object Detection
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object DetectionIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022
Yanxin Long
Jianhua Han
Runhu Huang
Xu Hang
Yi Zhu
Chunjing Xu
Xiaodan Liang
VLMObjD
226
29
0
02 Nov 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets,
  Techniques, Challenges and Opportunities
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Chandu
A. Geramifard
209
3
0
30 Oct 2022
DiMBERT: Learning Vision-Language Grounded Representations with
  Disentangled Multimodal-Attention
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
195
13
0
28 Oct 2022
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
Open-vocabulary Semantic Segmentation with Frozen Vision-Language ModelsBritish Machine Vision Conference (BMVC), 2022
Chaofan Ma
Yu-Hao Yang
Yanfeng Wang
Ya Zhang
Weidi Xie
VLM
159
55
0
27 Oct 2022
Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in FashionMachine Intelligence Research (MIR), 2022
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
234
27
0
27 Oct 2022
End-to-End Multimodal Representation Learning for Video Dialog
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
202
7
0
26 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak
  Supervision
Learning by Hallucinating: Vision-Language Pre-training with Weak SupervisionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
146
6
0
24 Oct 2022
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
  Language Grounding
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language GroundingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuechen Wang
Wen-gang Zhou
Houqiang Li
AI4TS
145
14
0
21 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
VTC: Improving Video-Text Retrieval with User CommentsEuropean Conference on Computer Vision (ECCV), 2022
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
231
8
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
  Translation
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
181
24
0
19 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Contrastive Language-Image Pre-Training with Knowledge GraphsNeural Information Processing Systems (NeurIPS), 2022
Xuran Pan
Tianzhu Ye
Dongchen Han
Qing Xiao
Gao Huang
VLMCLIP
191
62
0
17 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge
  Distillation and Modal-adaptive Pruning
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive PruningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
204
63
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in
  Vision-Language Pre-training
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-trainingConference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLMVLM
300
75
0
14 Oct 2022
Understanding Embodied Reference with Touch-Line Transformer
Understanding Embodied Reference with Touch-Line TransformerInternational Conference on Learning Representations (ICLR), 2022
Yongqian Li
Xiaoxue Chen
Hao Zhao
Jiangtao Gong
Guyue Zhou
Federico Rossano
Yixin Zhu
283
20
0
11 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale
  Pre-training
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
160
7
0
10 Oct 2022
Previous
12345...91011
Next