Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval
Pattern Recognition (Pattern Recogn.), 2022
Yan Gong
Georgina Cosma
592
14
0
10 Oct 2022
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Findings (Findings), 2022
Wanrong Zhu
An Yan
Yujie Lu
Wenda Xu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
324
38
0
07 Oct 2022
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
Bin Shan
Weichong Yin
Yu Sun
Hao Tian
Hua Wu
Haifeng Wang
VLM
162
21
0
30 Sep 2022
Domain-Unified Prompt Representations for Source-Free Domain Generalization
Hongjing Niu
Hanting Li
Feng Zhao
Bin Li
VLM
262
29
0
29 Sep 2022
TVLT: Textless Vision-Language Transformer
Neural Information Processing Systems (NeurIPS), 2022
Zineng Tang
Jaemin Cho
Yixin Nie
Joey Tianyi Zhou
VLM
348
36
0
28 Sep 2022
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding
Neural Information Processing Systems (NeurIPS), 2022
Yang Jin
Yongzhi Li
Zehuan Yuan
Yadong Mu
246
48
0
27 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
International Conference on Learning Representations (ICLR), 2022
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
440
96
0
14 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
IEEE International Conference on Computer Vision (ICCV), 2022
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
350
38
0
12 Sep 2022
Multi-Modal Experience Inspired AI Creation
ACM Multimedia (ACM MM), 2022
Qian Cao
Xu Chen
Ruihua Song
Hao Jiang
Guangyan Yang
Bo Zhao
152
4
0
02 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
British Machine Vision Conference (BMVC), 2022
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
311
28
0
29 Aug 2022
Prompt Tuning with Soft Context Sharing for Vision-Language Models
Neurocomputing (Neurocomputing), 2022
Kun Ding
Ying Wang
Pengzhang Liu
Qiang Yu
Hao Zhang
Shiming Xiang
Chunhong Pan
VPVLM
VLM
280
20
0
29 Aug 2022
Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning
ACM Multimedia (ACM MM), 2022
Yabing Wang
Jianfeng Dong
Tianxiang Liang
Minsong Zhang
Rui Cai
Xun Wang
275
29
0
26 Aug 2022
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
Computer Vision and Pattern Recognition (CVPR), 2022
Xiaoyi Dong
Jianmin Bao
Yinglin Zheng
Ting Zhang
Dongdong Chen
...
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen
Nenghai Yu
CLIP
VLM
290
224
0
25 Aug 2022
Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization
Chenhao Cui
Xinnian Liang
Shuangzhi Wu
Zhoujun Li
190
6
0
24 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Yanbei Chen
Goran Frehse
Xiatian Zhu
Zeynep Akata
351
174
0
24 Aug 2022
Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks
Journal of Imaging (JI), 2022
Tianwei Chen
Noa Garcia
Mayu Otani
Chenhui Chu
Yuta Nakashima
Hajime Nagahara
VLM
139
1
0
23 Aug 2022
Revising Image-Text Retrieval via Multi-Modal Entailment
Xu Yan
Chunhui Ai
Ziqiang Cao
Min Cao
Sujian Li
Wen-Yi Chen
Guohong Fu
277
1
0
22 Aug 2022
Semantic-Enhanced Image Clustering
AAAI Conference on Artificial Intelligence (AAAI), 2022
Shao-Qian Cai
Li-qing Qiu
Xiaojun Chen
Qin Zhang
Long Chen
VLM
190
44
0
21 Aug 2022
Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on Aligned Visual-Textual Features
Shichao Xu
Yikang Li
Jenhao Hsiao
C. Ho
Zhuang Qi
314
11
0
19 Aug 2022
VLMAE: Vision-Language Masked Autoencoder
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Chen Wu
Xiujun Shu
Bohan Ren
VLM
202
11
0
19 Aug 2022
Multimodal foundation models are better simulators of the human brain
Haoyu Lu
Qiongyi Zhou
Nanyi Fei
Zhiwu Lu
Mingyu Ding
...
Changde Du
Xin Zhao
Haoran Sun
Huiguang He
J. Wen
AI4CE
183
19
0
17 Aug 2022
Understanding Attention for Vision-and-Language Tasks
International Conference on Computational Linguistics (COLING), 2022
Feiqi Cao
S. Han
Siqu Long
Changwei Xu
Josiah Poon
261
7
0
17 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
European Conference on Computer Vision (ECCV), 2022
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
212
13
0
08 Aug 2022
Prompt Tuning for Generative Multimodal Pretrained Models
Han Yang
Junyang Lin
An Yang
Peng Wang
Chang Zhou
Hongxia Yang
VLM
LRM
VPVLM
183
37
0
04 Aug 2022
Masked Vision and Language Modeling for Multi-modal Representation Learning
International Conference on Learning Representations (ICLR), 2022
Gukyeong Kwon
Zhaowei Cai
Avinash Ravichandran
Erhan Bas
Rahul Bhotika
Stefano Soatto
257
84
0
03 Aug 2022
Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
International Conference on Pattern Recognition (ICPR), 2022
Xiaoyuan Guo
Jiali Duan
C.-C. Jay Kuo
J. Gichoya
Imon Banerjee
VLM
186
1
0
31 Jul 2022
ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval
International Conference on Content-Based Multimedia Indexing (CBMI), 2022
Nicola Messina
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Fabrizio Falchi
Giuseppe Amato
Rita Cucchiara
VLM
132
26
0
29 Jul 2022
Paired Cross-Modal Data Augmentation for Fine-Grained Image-to-Text Retrieval
ACM Multimedia (ACM MM), 2022
Hao Wang
Guosheng Lin
Guosheng Lin
Steven C. H. Hoi
165
17
0
29 Jul 2022
Temporal and cross-modal attention for audio-visual zero-shot learning
European Conference on Computer Vision (ECCV), 2022
Otniel-Bogdan Mercea
Thomas Hummel
A. Sophia Koepke
Zeynep Akata
205
32
0
20 Jul 2022
Explicit Image Caption Editing
European Conference on Computer Vision (ECCV), 2022
Zhen Wang
Long Chen
Wenbo Ma
G. Han
Yulei Niu
Jian Shao
Jun Xiao
191
14
0
20 Jul 2022
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
European Conference on Computer Vision (ECCV), 2022
Qi Zhang
Yuqing Song
Qin Jin
179
32
0
18 Jul 2022
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
European Conference on Computer Vision (ECCV), 2022
Yuqi Liu
Pengfei Xiong
Luhui Xu
Shengming Cao
Qin Jin
267
171
0
16 Jul 2022
Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
ACM Multimedia (ACM MM), 2022
Zhiyin Shao
Xinyu Zhang
Meng Fang
Zhi-hao Lin
Jian Wang
Changxing Ding
254
149
0
16 Jul 2022
Learning to translate by learning to communicate
C.M. Downey
Xuhui Zhou
Leo Z. Liu
Shane Steinert-Threlkeld
197
5
0
14 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
182
14
0
11 Jul 2022
Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer
AAAI Conference on Artificial Intelligence (AAAI), 2022
Su He
Taian Guo
Tao Dai
Ruizhi Qiao
Bo Ren
Shutao Xia
VLM
291
66
0
05 Jul 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLM
CLIP
282
2
0
05 Jul 2022
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval
Keyu Wen
Zhenshan Tan
Qingrong Cheng
Cheng Chen
X. Gu
VLM
198
1
0
02 Jul 2022
DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
ObjD
343
24
0
20 Jun 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
International Conference on Machine Learning (ICML), 2022
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
209
54
0
17 Jun 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
AAAI Conference on Artificial Intelligence (AAAI), 2022
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
254
93
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Neural Information Processing Systems (NeurIPS), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
485
277
0
16 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Neural Information Processing Systems (NeurIPS), 2022
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
296
152
0
15 Jun 2022
LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
Neural Information Processing Systems (NeurIPS), 2022
Tuan Dinh
Yuchen Zeng
Ruisu Zhang
Ziqian Lin
Michael Gira
Shashank Rajput
Jy-yong Sohn
Dimitris Papailiopoulos
Kangwook Lee
LMTD
576
172
0
14 Jun 2022
Multimodal Learning with Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
571
846
0
13 Jun 2022
GLIPv2: Unifying Localization and Vision-Language Understanding
Haotian Zhang
Pengchuan Zhang
Xiaowei Hu
Yen-Chun Chen
Liunian Harold Li
Xiyang Dai
Lijuan Wang
Lu Yuan
Lei Li
Jianfeng Gao
ObjD
VLM
296
354
0
12 Jun 2022
A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training
Zhihao Fan
Zhongyu Wei
Jingjing Chen
Siyuan Wang
Zejun Li
Jiarong Xu
Xuanjing Huang
CLL
155
6
0
11 Jun 2022
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
International Conference on Pattern Recognition (ICPR), 2022
Kshitij Gupta
Devansh Gautam
R. Mamidi
VLM
308
4
0
07 Jun 2022
ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences
Christos Tzelepis
James Oldfield
Georgios Tzimiropoulos
Ioannis Patras
147
16
0
05 Jun 2022
ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts
Computer Vision and Pattern Recognition (CVPR), 2022
Bingqian Lin
Yi Zhu
Zicong Chen
Xiwen Liang
Jian-zhuo Liu
Xiaodan Liang
LM&Ro
210
61
0
31 May 2022
Previous
1
2
3
4
5
6
...
9
10
11
Next
Page 5 of 11
Page
of 11
Go