ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
VLP: A Survey on Vision-Language Pre-training
VLP: A Survey on Vision-Language Pre-trainingMachine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
393
287
0
18 Feb 2022
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with
  Omni Retrieval
CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni RetrievalKnowledge Discovery and Data Mining (KDD), 2022
Licheng Yu
Jun Chen
Animesh Sinha
Mengjiao MJ Wang
Hugo Chen
Tamara L. Berg
Ning Zhang
VLM
257
44
0
15 Feb 2022
Multi-Modal Knowledge Graph Construction and Application: A Survey
Multi-Modal Knowledge Graph Construction and Application: A SurveyIEEE Transactions on Knowledge and Data Engineering (TKDE), 2022
Xiangru Zhu
Zhixu Li
Xiaodan Wang
Xueyao Jiang
Yixiang Chen
Xuwu Wang
Yanghua Xiao
N. Yuan
207
233
0
11 Feb 2022
Image Difference Captioning with Pre-training and Contrastive Learning
Image Difference Captioning with Pre-training and Contrastive LearningAAAI Conference on Artificial Intelligence (AAAI), 2022
Linli Yao
Weiying Wang
Qin Jin
SSLVLM
239
55
0
09 Feb 2022
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
  Sequence-to-Sequence Learning Framework
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkInternational Conference on Machine Learning (ICML), 2022
Peng Wang
An Yang
Rui Men
Junyang Lin
Shuai Bai
Zhikang Li
Jianxin Ma
Chang Zhou
Jingren Zhou
Hongxia Yang
MLLMObjD
517
1,009
0
07 Feb 2022
A Frustratingly Simple Approach for End-to-End Image Captioning
A Frustratingly Simple Approach for End-to-End Image Captioning
Ziyang Luo
Yadong Xi
Rongsheng Zhang
Jing Ma
VLMMLLM
237
19
0
30 Jan 2022
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training
  via Multi-Stage Learning
MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage LearningACM Multimedia (ACM MM), 2022
Zejun Li
Zhihao Fan
Huaixiao Tou
Jingjing Chen
Zhongyu Wei
Xuanjing Huang
235
23
0
29 Jan 2022
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
Peixi Xiong
Yilin Shen
Hongxia Jin
108
8
0
25 Jan 2022
Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis
  for Eyewear Devices
Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear DevicesProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2022
Yingying Zhao
Yuhu Chang
Yutian Lu
Yujiang Wang
Mingzhi Dong
...
Robert P. Dick
Fan Yang
Tun Lu
Ning Gu
L. Shang
183
15
0
24 Jan 2022
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
  Vision-Language Pre-training
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training
Yehao Li
Jiahao Fan
Yingwei Pan
Ting Yao
Weiyao Lin
Tao Mei
MLLMObjD
220
24
0
11 Jan 2022
On the Efficacy of Co-Attention Transformer Layers in Visual Question
  Answering
On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering
Ankur Sikarwar
Gabriel Kreiman
ViT
95
2
0
11 Jan 2022
Language-driven Semantic Segmentation
Language-driven Semantic SegmentationInternational Conference on Learning Representations (ICLR), 2022
Boyi Li
Kilian Q. Weinberger
Serge Belongie
V. Koltun
René Ranftl
VLM
329
780
0
10 Jan 2022
Self-Training Vision Language BERTs with a Unified Conditional Model
Self-Training Vision Language BERTs with a Unified Conditional Model
Xiaofeng Yang
Fengmao Lv
Fayao Liu
Guosheng Lin
SSLVLM
306
18
0
06 Jan 2022
Discrete and continuous representations and processing in deep learning:
  Looking forward
Discrete and continuous representations and processing in deep learning: Looking forwardAI Open (AO), 2022
Ruben Cartuyvels
Graham Spinks
Marie-Francine Moens
OCL
300
28
0
04 Jan 2022
A Simple Baseline for Open-Vocabulary Semantic Segmentation with
  Pre-trained Vision-language Model
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelEuropean Conference on Computer Vision (ECCV), 2021
Mengde Xu
Zheng Zhang
Fangyun Wei
Yutong Lin
Yue Cao
Han Hu
Xiang Bai
VLM
381
287
0
29 Dec 2021
LaTr: Layout-Aware Transformer for Scene-Text VQA
LaTr: Layout-Aware Transformer for Scene-Text VQAComputer Vision and Pattern Recognition (CVPR), 2021
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
378
116
0
23 Dec 2021
KAT: A Knowledge Augmented Transformer for Vision-and-Language
KAT: A Knowledge Augmented Transformer for Vision-and-Language
Liangke Gui
Borui Wang
Qiuyuan Huang
Alexander G. Hauptmann
Yonatan Bisk
Jianfeng Gao
240
196
0
16 Dec 2021
VALSE: A Task-Independent Benchmark for Vision and Language Models
  Centered on Linguistic Phenomena
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
301
135
0
14 Dec 2021
CoCo-BERT: Improving Video-Language Pre-training with Contrastive
  Cross-modal Matching and Denoising
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising
Jianjie Luo
Yehao Li
Yingwei Pan
Ting Yao
Hongyang Chao
Tao Mei
VLM
155
45
0
14 Dec 2021
ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval
ACE-BERT: Adversarial Cross-modal Enhanced BERT for E-commerce Retrieval
Boxuan Zhang
Chao Wei
Yang Jin
Weiru Zhang
101
3
0
14 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
165
8
0
08 Dec 2021
Grounded Language-Image Pre-training
Grounded Language-Image Pre-training
Liunian Harold Li
Pengchuan Zhang
Haotian Zhang
Jianwei Yang
Chunyuan Li
...
Lu Yuan
Lei Zhang
Lei Li
Kai-Wei Chang
Jianfeng Gao
ObjDVLM
458
1,385
0
07 Dec 2021
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification
Huidong Liu
Shaoyuan Xu
Jinmiao Fu
Yang Liu
Ning Xie
Chien Wang
Bryan Wang
Yi Sun
CLIPVLM
206
30
0
07 Dec 2021
Semantic Segmentation In-the-Wild Without Seeing Any Segmentation
  Examples
Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples
Nir Zabari
Yedid Hoshen
VLM
214
29
0
06 Dec 2021
General Facial Representation Learning in a Visual-Linguistic Manner
General Facial Representation Learning in a Visual-Linguistic MannerComputer Vision and Pattern Recognition (CVPR), 2021
Yinglin Zheng
Hao Yang
Ting Zhang
Jianmin Bao
Dongdong Chen
Yangyu Huang
Lu Yuan
Dong Chen
Ming Zeng
Fang Wen
CVBM
461
230
0
06 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
250
152
0
02 Dec 2021
AssistSR: Task-oriented Video Segment Retrieval for Personal AI
  Assistant
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
Stan Weixian Lei
Difei Gao
Yuxuan Wang
Dongxing Mao
Zihan Liang
L. Ran
Mike Zheng Shou
299
8
0
30 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
192
83
0
25 Nov 2021
Generating More Pertinent Captions by Leveraging Semantics and Style on
  Multi-Source Datasets
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets
Marcella Cornia
Lorenzo Baraldi
G. Fiameni
Rita Cucchiara
320
14
0
24 Nov 2021
Open-Vocabulary Instance Segmentation via Robust Cross-Modal
  Pseudo-Labeling
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling
Dat T. Huynh
Jason Kuen
Zhe Lin
Jiuxiang Gu
Ehsan Elhamifar
ISegVLM
284
100
0
24 Nov 2021
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
  Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
Tsu-Jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
Wenjie Wang
Lijuan Wang
Zicheng Liu
VLM
402
239
0
24 Nov 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLMVLM
420
297
0
24 Nov 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
348
134
0
23 Nov 2021
RedCaps: web-curated image-text data created by the people, for the
  people
RedCaps: web-curated image-text data created by the people, for the people
Karan Desai
Gaurav Kaul
Zubin Aysola
Justin Johnson
283
191
0
22 Nov 2021
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
Xu Yan
Zhengcong Fei
Shuhui Wang
Qingming Huang
Qi Tian
VGen
248
4
0
19 Nov 2021
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning
Jianfeng Wang
Xiaowei Hu
Zhe Gan
Zhengyuan Yang
Xiyang Dai
Zicheng Liu
Yumao Lu
Lijuan Wang
ViT
176
62
0
19 Nov 2021
Achieving Human Parity on Visual Question Answering
Achieving Human Parity on Visual Question Answering
Ming Yan
Haiyang Xu
Chenliang Li
Junfeng Tian
Bin Bi
...
Ji Zhang
Songfang Huang
Fei Huang
Luo Si
Rong Jin
146
19
0
17 Nov 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual
  Concepts
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsInternational Conference on Machine Learning (ICML), 2021
Yan Zeng
Xinsong Zhang
Hang Li
VLMCLIP
331
352
0
16 Nov 2021
Multimodal Transformer with Variable-length Memory for
  Vision-and-Language Navigation
Multimodal Transformer with Variable-length Memory for Vision-and-Language NavigationEuropean Conference on Computer Vision (ECCV), 2021
Chuang Lin
Yi Jiang
Jianfei Cai
Zhuang Li
Gholamreza Haffari
Zehuan Yuan
180
37
0
10 Nov 2021
FILIP: Fine-grained Interactive Language-Image Pre-Training
FILIP: Fine-grained Interactive Language-Image Pre-TrainingInternational Conference on Learning Representations (ICLR), 2021
Lewei Yao
Runhu Huang
Lu Hou
Guansong Lu
Minzhe Niu
Hang Xu
Xiaodan Liang
Zhenguo Li
Xin Jiang
Chunjing Xu
VLMCLIP
336
761
0
09 Nov 2021
A Survey on Green Deep Learning
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
457
93
0
08 Nov 2021
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences
  for Image-Text Retrieval
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval
Zhihao Fan
Zhongyu Wei
Zejun Li
Siyuan Wang
Jianqing Fan
179
7
0
05 Nov 2021
Towards artificial general intelligence via a multimodal foundation
  model
Towards artificial general intelligence via a multimodal foundation model
Nanyi Fei
Zhiwu Lu
Yizhao Gao
Guoxing Yang
Yuqi Huo
...
Ruihua Song
Xin Gao
Tao Xiang
Haoran Sun
Jiling Wen
AI4CELRM
225
284
0
27 Oct 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning
  for Visual Sound Separation
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
142
8
0
26 Oct 2021
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal
  Retrieval
VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal RetrievalKnowledge-Based Systems (KBS), 2021
Lisai Zhang
Hongfa Wu
Qingcai Chen
Yimeng Deng
Zhonghua Li
Dejiang Kong
Bo Zhao
Joanna Siebert
Yunpeng Han
ViTVLM
206
24
0
20 Oct 2021
TransFusion: Cross-view Fusion with Transformer for 3D Human Pose
  Estimation
TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation
Haoyu Ma
Liangjian Chen
Deying Kong
Zhe Wang
Xingwei Liu
Hao Tang
Xiangyi Yan
Yusheng Xie
Shi-yao Lin
Xiaohui Xie
ViT
335
72
0
18 Oct 2021
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign
  Language Recognition
SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language RecognitionIEEE International Conference on Computer Vision (ICCV), 2021
Hezhen Hu
Weichao Zhao
Wen-gang Zhou
Yuechen Wang
Houqiang Li
ViT
263
105
0
11 Oct 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
  Understanding
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
Gargi Ghosh
Po-Yao (Bernie) Huang
Dmytro Okhonko
Armen Aghajanyan
Florian Metze
Luke Zettlemoyer
Florian Metze Luke Zettlemoyer Christoph Feichtenhofer
CLIPVLM
806
690
0
28 Sep 2021
Visually Grounded Reasoning across Languages and Cultures
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Siva Reddy
Nigel Collier
Desmond Elliott
VLMLRM
473
201
0
28 Sep 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object
  Knowledge Distillation
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIPVLM
281
32
0
22 Sep 2021
Previous
123...10116789
Next