Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
VLP: A Survey on Vision-Language Pre-training
Machine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
393
287
0
18 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval
Knowledge Discovery and Data Mining (KDD), 2022
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
257
44
0
15 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
IEEE Transactions on Knowledge and Data Engineering (TKDE), 2022
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Yixiang Chen
Xuwu Wang
Yanghua Xiao
N. Yuan
207
233
0
11 Feb 2022
Image Difference Captioning with Pre-training and Contrastive Learning
AAAI Conference on Artificial Intelligence (AAAI), 2022
Linli Yao
Weiying Wang
Qin Jin
SSL
VLM
239
55
0
09 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
International Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLM
ObjD
517
1,009
0
07 Feb 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLM
MLLM
237
19
0
30 Jan 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning
ACM Multimedia (ACM MM), 2022
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
235
23
0
29 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
108
8
0
25 Jan 2022
Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2022
Yingying Zhao
Yuhu Chang
Yutian Lu
Yujiang Wang
Mingzhi Dong
...
Robert P. Dick
Fan Yang
Tun Lu
Ning Gu
L. Shang
183
15
0
24 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLM
ObjD
220
24
0
11 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
95
2
0
11 Jan 2022
Language-driven Semantic Segmentation
International Conference on Learning Representations (ICLR), 2022
Boyi Li
Kilian Q. Weinberger
Serge Belongie
V. Koltun
René Ranftl
VLM
329
780
0
10 Jan 2022
Self-Training Vision Language BERTs with a Unified Conditional Model
Xiaofeng Yang
Fengmao Lv
Fayao Liu
Guosheng Lin
SSL
VLM
306
18
0
06 Jan 2022
Discrete and continuous representations and processing in deep learning: Looking forward
AI Open (AO), 2022
Ruben Cartuyvels
Graham Spinks
Marie-Francine Moens
OCL
300
28
0
04 Jan 2022
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
European Conference on Computer Vision (ECCV), 2021
Mengde Xu
Zheng Zhang
Fangyun Wei
Yutong Lin
Yue Cao
Han Hu
Xiang Bai
VLM
381
287
0
29 Dec 2021
LaTr: Layout-Aware Transformer for Scene-Text VQA
Computer Vision and Pattern Recognition (CVPR), 2021
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
378
116
0
23 Dec 2021
KAT: A Knowledge Augmented Transformer for Vision-and-Language
Liangke Gui
Borui Wang
Qiuyuan Huang
Alexander G. Hauptmann
Yonatan Bisk
Jianfeng Gao
240
196
0
16 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
301
135
0
14 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
155
45
0
14 Dec 2021
ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval
Boxuan Zhang
Chao Wei
Yang Jin
Weiru Zhang
101
3
0
14 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
165
8
0
08 Dec 2021
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjD
VLM
458
1,385
0
07 Dec 2021
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
Huidong Liu
Shaoyuan Xu
Jinmiao Fu
Yang Liu
Ning Xie
Chien Wang
Bryan Wang
Yi Sun
CLIP
VLM
206
30
0
07 Dec 2021
Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples
Nir Zabari
Yedid Hoshen
VLM
214
29
0
06 Dec 2021
General Facial Representation Learning in a Visual-Linguistic Manner
Computer Vision and Pattern Recognition (CVPR), 2021
Yinglin Zheng
Hao Yang
Ting Zhang
Jianmin Bao
Dongdong Chen
Yangyu Huang
Lu Yuan
Dong Chen
Ming Zeng
Fang Wen
CVBM
461
230
0
06 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
250
152
0
02 Dec 2021
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
Stan Weixian Lei
Difei Gao
Yuxuan Wang
Dongxing Mao
Zihan Liang
L. Ran
Mike Zheng Shou
299
8
0
30 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
192
83
0
25 Nov 2021
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
320
14
0
24 Nov 2021
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling
Dat T. Huynh
Jason Kuen
Zhe Lin
Jiuxiang Gu
Ehsan Elhamifar
ISeg
VLM
284
100
0
24 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
402
239
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
420
297
0
24 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
348
134
0
23 Nov 2021
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
283
191
0
22 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
248
4
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
176
62
0
19 Nov 2021
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
146
19
0
17 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
International Conference on Machine Learning (ICML), 2021
Yan Zeng
Xinsong Zhang
Hang Li
VLM
CLIP
331
352
0
16 Nov 2021
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation
European Conference on Computer Vision (ECCV), 2021
Chuang Lin
Yi Jiang
Jianfei Cai
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
180
37
0
10 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
International Conference on Learning Representations (ICLR), 2021
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLM
CLIP
336
761
0
09 Nov 2021
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
457
93
0
08 Nov 2021
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Jianqing Fan
179
7
0
05 Nov 2021
Towards artificial general intelligence via a multimodal foundation model
Nanyi Fei
Zhiwu Lu
Yizhao Gao
Guoxing Yang
Yuqi Huo
...
Ruihua Song
Xin Gao
Tao Xiang
Haoran Sun
Jiling Wen
AI4CE
LRM
225
284
0
27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
142
8
0
26 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval
Knowledge-Based Systems (KBS), 2021
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViT
VLM
206
24
0
20 Oct 2021
TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation
Haoyu Ma
Liangjian Chen
Deying Kong
Zhe Wang
Xingwei Liu
Hao Tang
Xiangyi Yan
Yusheng Xie
Shi-yao Lin
Xiaohui Xie
ViT
335
72
0
18 Oct 2021
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition
IEEE International Conference on Computer Vision (ICCV), 2021
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Yuechen Wang
Houqiang Li
ViT
263
105
0
11 Oct 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIP
VLM
806
690
0
28 Sep 2021
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLM
LRM
473
201
0
28 Sep 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIP
VLM
281
32
0
22 Sep 2021
Previous
1
2
3
...
10
11
6
7
8
9
Next