Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Journal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
356
64
0
14 Feb 2023
Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions
Findings (Findings), 2023
Henrik Voigt
J. Hombeck
M. Meuschke
K. Lawonn
Sina Zarrieß
VLM
229
2
0
13 Feb 2023
VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
ACM Transactions on Knowledge Discovery from Data (TKDD), 2023
Yansong Gong
Georgina Cosma
Axel Finke
ViT
299
4
0
13 Feb 2023
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation
AAAI Conference on Artificial Intelligence (AAAI), 2023
Bingqian Lin
Yi Zhu
Xiaodan Liang
Liang Lin
Jian-zhuo Liu
CoGe
LM&Ro
290
5
0
13 Feb 2023
Unified Vision-Language Representation Modeling for E-Commerce Same-Style Products Retrieval
The Web Conference (WWW), 2023
Ben Chen
Linbo Jin
Xinxin Wang
D. Gao
Wen Jiang
Wei Ning
311
10
0
10 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
IEEE transactions on multimedia (IEEE TMM), 2023
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
266
12
0
04 Feb 2023
ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency
International Conference on Learning Representations (ICLR), 2023
Pengzhen Ren
Changlin Li
Hang Xu
Yi Zhu
Guangrun Wang
Jian-zhuo Liu
Xiaojun Chang
Xiaodan Liang
209
57
0
31 Jan 2023
Effective End-to-End Vision Language Pretraining with Semantic Visual Loss
IEEE transactions on multimedia (IEEE TMM), 2023
Xiaofeng Yang
Fayao Liu
Guosheng Lin
VLM
97
14
0
18 Jan 2023
CLIP the Gap: A Single Domain Generalization Approach for Object Detection
Computer Vision and Pattern Recognition (CVPR), 2023
Vidit Vidit
Martin Engilberge
Mathieu Salzmann
VLM
ObjD
246
136
0
13 Jan 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning
Zhenfang Chen
Qinhong Zhou
Songlin Yang
Yining Hong
Hao Zhang
Chuang Gan
LRM
VLM
269
54
0
12 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
European Conference on Information Retrieval (ECIR), 2023
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
300
5
0
12 Jan 2023
Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering
European Conference on Information Retrieval (ECIR), 2023
Paul Lerner
O. Ferret
C. Guinaudeau
234
12
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zhuosheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
270
30
0
09 Jan 2023
Text2Poster: Laying out Stylized Texts on Retrieved Images
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Chuhao Jin
Hongteng Xu
Ruihua Song
Zhiwu Lu
DiffM
135
11
0
06 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Computer Vision and Pattern Recognition (CVPR), 2023
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
460
47
0
05 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Computer Vision and Pattern Recognition (CVPR), 2023
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
175
20
0
05 Jan 2023
BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
Haowen Hou
Xiaopeng Yan
Yigeng Zhang
Fengzong Lian
Zhanhui Kang
BDL
127
2
0
29 Dec 2022
On Transforming Reinforcement Learning by Transformer: The Development Trajectory
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Shengchao Hu
Li Shen
Ya Zhang
Yixin Chen
Dacheng Tao
OffRL
340
61
0
29 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Computer Vision and Pattern Recognition (CVPR), 2022
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
170
46
0
19 Dec 2022
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Letitia Parcalabescu
Anette Frank
218
49
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2022
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
250
38
0
14 Dec 2022
CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
International Conference on Information Photonics (ICIP), 2022
Kevin Hyekang Joo
Khoa T. Vo
Kashu Yamazaki
Ngan Le
229
93
0
09 Dec 2022
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
Computer Vision and Image Understanding (CVIU), 2022
Mustafa Shukor
Nicolas Thome
Matthieu Cord
CLIP
CoGe
270
15
0
08 Dec 2022
CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation
Neural Information Processing Systems (NeurIPS), 2022
Zicheng Zhang
Yi Zhu
Jian-zhuo Liu
Xiaodan Liang
Wei Ke
219
36
0
04 Dec 2022
Protein Language Models and Structure Prediction: Connection and Progression
Bozhen Hu
Jun Xia
Jiangbin Zheng
Cheng Tan
Yufei Huang
Yongjie Xu
Stan Z. Li
210
45
0
30 Nov 2022
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles
Computer Vision and Pattern Recognition (CVPR), 2022
Shuquan Ye
Yujia Xie
Dongdong Chen
Yichong Xu
Lu Yuan
Chenguang Zhu
Jing Liao
VLM
135
18
0
29 Nov 2022
Unified Multimodal Model with Unlikelihood Training for Visual Dialog
ACM Multimedia (ACM MM), 2022
Zihao Wang
Junli Wang
Changjun Jiang
MLLM
180
13
0
23 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
178
10
0
21 Nov 2022
ClipCrop: Conditioned Cropping Driven by Vision-Language Model
Zhihang Zhong
Mingxi Cheng
Zhirong Wu
Yuhui Yuan
Yinqiang Zheng
Ji Li
Han Hu
Stephen Lin
Yoichi Sato
Imari Sato
VLM
CLIP
131
8
0
21 Nov 2022
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Computer Vision and Pattern Recognition (CVPR), 2022
Sheng Tang
Yaqing Wang
Zhenglun Kong
Tianchi Zhang
Yao Li
Caiwen Ding
Yanzhi Wang
Yi Liang
Dongkuan Xu
209
49
0
21 Nov 2022
Detect Only What You Specify : Object Detection with Linguistic Target
Moyuru Yamada
ObjD
VLM
94
0
0
18 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Computer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
241
55
0
17 Nov 2022
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge
The Web Conference (WWW), 2022
Linli Yao
Wei Chen
Qin Jin
VLM
318
11
0
17 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
238
6
0
14 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
ACM Multimedia (ACM MM), 2022
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
171
2
0
07 Nov 2022
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022
Yanxin Long
Jianhua Han
Runhu Huang
Xu Hang
Yi Zhu
Chunjing Xu
Xiaodan Liang
VLM
ObjD
226
29
0
02 Nov 2022
Multilingual Multimodality: A Taxonomical Survey of Datasets, Techniques, Challenges and Opportunities
Khyathi Chandu
A. Geramifard
209
3
0
30 Oct 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
ACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
195
13
0
28 Oct 2022
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
British Machine Vision Conference (BMVC), 2022
Chaofan Ma
Yu-Hao Yang
Yanfeng Wang
Ya Zhang
Weidi Xie
VLM
159
55
0
27 Oct 2022
Masked Vision-Language Transformer in Fashion
Machine Intelligence Research (MIR), 2022
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
234
27
0
27 Oct 2022
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
202
7
0
26 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
146
6
0
24 Oct 2022
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuechen Wang
Wen-gang Zhou
Houqiang Li
AI4TS
145
14
0
21 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
European Conference on Computer Vision (ECCV), 2022
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
231
8
0
19 Oct 2022
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Hongcheng Guo
Jiaheng Liu
Haoyang Huang
Jian Yang
Zhoujun Li
Dongdong Zhang
Zheng Cui
Furu Wei
181
24
0
19 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Neural Information Processing Systems (NeurIPS), 2022
Xuran Pan
Tianzhu Ye
Dongchen Han
Qing Xiao
Gao Huang
VLM
CLIP
191
62
0
17 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
204
63
0
14 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLM
VLM
300
75
0
14 Oct 2022
Understanding Embodied Reference with Touch-Line Transformer
International Conference on Learning Representations (ICLR), 2022
Yongqian Li
Xiaoxue Chen
Hao Zhao
Jiangtao Gong
Guyue Zhou
Federico Rossano
Yixin Zhu
283
20
0
11 Oct 2022
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn
James M. Rehg
LM&Ro
160
7
0
10 Oct 2022
Previous
1
2
3
4
5
...
9
10
11
Next