Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
308
14
0
30 May 2022
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution
Pattern Recognition (Pattern Recogn.), 2022
Xintong Yu
Hongming Zhang
Ruixin Hong
Yangqiu Song
Changshui Zhang
181
17
0
29 May 2022
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Liang Zhang
Anwen Hu
Qin Jin
VLM
141
6
0
29 May 2022
DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation
Jingnong Qu
Liunian Harold Li
Jieyu Zhao
Sunipa Dev
Kai-Wei Chang
121
15
0
25 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
226
6
0
24 May 2022
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Shruti Palaskar
Akshita Bhagia
Yonatan Bisk
Florian Metze
A. Black
Ana Marasović
255
4
0
24 May 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
256
43
0
23 May 2022
Learning to Answer Visual Questions from Web Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
314
39
0
10 May 2022
Joint learning of object graph and relation graph for visual question answering
IEEE International Conference on Multimedia and Expo (ICME), 2022
Hao Li
Xu Li
Belhal Karimi
Jie Chen
Mingming Sun
GNN
141
26
0
09 May 2022
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
Wei Feng
Xingyuan Bu
Chenchen Zhang
Xubin Li
VLM
148
5
0
09 May 2022
CCMB: A Large-scale Chinese Cross-modal Benchmark
ACM Multimedia (ACM MM), 2022
Chunyu Xie
Heng Cai
Jincheng Li
Fanjing Kong
Xiaoyu Wu
...
Xiangzheng Zhang
Dawei Leng
Baochang Zhang
Xiangyang Ji
Yafeng Deng
MLLM
VLM
273
21
0
08 May 2022
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
Xiang Chen
Ningyu Zhang
Lei Li
Yunzhi Yao
Shumin Deng
Chuanqi Tan
Fei Huang
Luo Si
Huajun Chen
130
46
0
07 May 2022
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Xiang Chen
Ningyu Zhang
Lei Li
Shumin Deng
Chuanqi Tan
Changliang Xu
Fei Huang
Luo Si
Huajun Chen
222
196
0
04 May 2022
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Neural Information Processing Systems (NeurIPS), 2022
Yuting Gao
Jinfeng Liu
Zihan Xu
Jinchao Zhang
Ke Li
Rongrong Ji
Chunhua Shen
VLM
CLIP
403
141
0
29 Apr 2022
CapOnImage: Context-driven Dense-Captioning on Image
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yiqi Gao
Xinglin Hou
Yuanmeng Zhang
Bo Xiao
Yuning Jiang
Peifeng Wang
189
13
0
27 Apr 2022
Contrastive Language-Action Pre-training for Temporal Localization
Mengmeng Xu
Erhan Gundogdu
⋆⋆ Maksim
Guohao Li
M. Donoser
Loris Bazzani
189
25
0
26 Apr 2022
Progressive Learning for Image Retrieval with Hybrid-Modality Queries
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Yida Zhao
Yuqing Song
Qin Jin
188
40
0
24 Apr 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
145
10
0
23 Apr 2022
Unified Pretraining Framework for Document Understanding
Neural Information Processing Systems (NeurIPS), 2022
Jiuxiang Gu
Jason Kuen
Vlad I. Morariu
Handong Zhao
Nikolaos Barmpalios
R. Jain
A. Nenkova
Tong Sun
272
111
0
22 Apr 2022
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
Yubo Zhang
Feiyang Niu
Q. Ping
Govind Thattai
CVBM
216
2
0
22 Apr 2022
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing
European Conference on Computer Vision (ECCV), 2022
Benedikt Boecking
Naoto Usuyama
Shruthi Bannur
Daniel Coelho De Castro
Anton Schwaighofer
...
Tristan Naumann
A. Nori
Javier Alvarez-Valle
Hoifung Poon
Ozan Oktay
486
358
0
21 Apr 2022
Imagination-Augmented Natural Language Understanding
North American Chapter of the Association for Computational Linguistics (NAACL), 2022
Yujie Lu
Wanrong Zhu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
216
25
0
18 Apr 2022
End-to-end Dense Video Captioning as Sequence Generation
International Conference on Computational Linguistics (COLING), 2022
Wanrong Zhu
Bo Pang
Ashish V. Thapliyal
William Yang Wang
Radu Soricut
DiffM
216
45
0
18 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
IEEE Transactions on Image Processing (IEEE TIP), 2022
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
153
57
0
16 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Computer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
254
55
0
15 Apr 2022
Vision-and-Language Pretrained Models: A Survey
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Siqu Long
Feiqi Cao
S. Han
Haiqing Yang
VLM
422
71
0
15 Apr 2022
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
Shunyu Zhang
X. Jiang
Zequn Yang
T. Wan
Zengchang Qin
164
14
0
10 Apr 2022
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
AAAI Conference on Artificial Intelligence (AAAI), 2022
Yunxing Kang
Tianqiao Liu
Hang Li
Y. Hao
Wenbiao Ding
164
9
0
10 Apr 2022
Temporal Alignment Networks for Long-term Video
Computer Vision and Pattern Recognition (CVPR), 2022
Tengda Han
Weidi Xie
Andrew Zisserman
AI4TS
169
104
0
06 Apr 2022
SimVQA: Exploring Simulated Environments for Visual Question Answering
Computer Vision and Pattern Recognition (CVPR), 2022
Paola Cascante-Bonilla
Hui Wu
Letao Wang
Rogerio Feris
Vicente Ordonez
209
9
0
31 Mar 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Computer Vision and Pattern Recognition (CVPR), 2022
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
277
72
0
31 Mar 2022
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Computer Vision and Pattern Recognition (CVPR), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
341
121
0
30 Mar 2022
Image-text Retrieval: A Survey on Recent Research and Development
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Min Cao
Shiping Li
Juntao Li
Liqiang Nie
Min Zhang
336
108
0
28 Mar 2022
Large-scale Bilingual Language-Image Contrastive Learning
ByungSoo Ko
Geonmo Gu
VLM
257
17
0
28 Mar 2022
Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)
International Conference on Machine Learning (ICML), 2022
Yu Huang
Junyang Lin
Chang Zhou
Hongxia Yang
Longbo Huang
171
144
0
23 Mar 2022
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Chen Liang
Wenguan Wang
Tianfei Zhou
Jiaxu Miao
Yawei Luo
Yi Yang
VOS
322
100
0
18 Mar 2022
Deep Unsupervised Hashing with Latent Semantic Components
AAAI Conference on Artificial Intelligence (AAAI), 2022
Qinghong Lin
Xiaojun Chen
Qin Zhang
Shao-Qian Cai
Wenzhe Zhao
Hongfa Wang
238
3
0
17 Mar 2022
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Findings (Findings), 2022
Wei Li
Can Gao
Guocheng Niu
Xinyan Xiao
Hao Liu
Jiachen Liu
Hua Wu
Haifeng Wang
MLLM
145
24
0
17 Mar 2022
The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy
Computer Vision and Pattern Recognition (CVPR), 2022
Tianlong Chen
Zhenyu Zhang
Yu Cheng
Ahmed Hassan Awadallah
Zinan Lin
ViT
256
49
0
12 Mar 2022
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
Jie Lei
Xinlei Chen
Ning Zhang
Meng-xing Wang
Joey Tianyi Zhou
Tamara L. Berg
Licheng Yu
229
15
0
10 Mar 2022
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Xiwen Liang
Fengda Zhu
Lingling Li
Hang Xu
Xiaodan Liang
LM&Ro
VLM
119
33
0
08 Mar 2022
Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
European Conference on Computer Vision (ECCV), 2022
Chuhui Xue
Wenqing Zhang
Yu Hao
Shijian Lu
Juil Sock
Song Bai
VLM
265
46
0
08 Mar 2022
Where Does the Performance Improvement Come From? -- A Reproducibility Concern about Image-Text Retrieval
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022
Jun Rao
Haiwei Yang
Liang Ding
Shuhan Qi
Yibing Zhan
Weifeng Liu
Dacheng Tao
OOD
236
34
0
08 Mar 2022
Find a Way Forward: a Language-Guided Semantic Map Navigator
Zehao Wang
Mingxiao Li
Minye Wu
Marie-Francine Moens
Tinne Tuytelaars
LM&Ro
144
4
0
07 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
204
41
0
03 Mar 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Computer Vision and Pattern Recognition (CVPR), 2022
Mingyang Zhou
Licheng Yu
Amanpreet Singh
Mengjiao MJ Wang
Zhou Yu
Ning Zhang
VLM
158
35
0
01 Mar 2022
Multi-modal Alignment using Representation Codebook
Computer Vision and Pattern Recognition (CVPR), 2022
Jiali Duan
Liqun Chen
Son Tran
Jinyu Yang
Yi Xu
Belinda Zeng
Trishul Chilimbi
486
78
0
28 Feb 2022
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Shuang Ma
Sai H. Vemprala
Wenshan Wang
Jayesh K. Gupta
Yale Song
Daniel J. McDuff
Ashish Kapoor
SSL
188
12
0
20 Feb 2022
A Survey of Vision-Language Pre-Trained Models
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
396
241
0
18 Feb 2022
AMS_ADRN at SemEval-2022 Task 5: A Suitable Image-text Multimodal Joint Modeling Method for Multi-task Misogyny Identification
International Workshop on Semantic Evaluation (SemEval), 2022
Da Li
Ming Yi
Yukai He
141
2
0
18 Feb 2022
Previous
1
2
3
...
5
6
7
...
9
10
11
Next