v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019

16 August 2019

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown

What Vision-Language Models `See' when they See Scenes

256

15 Sep 2021

xGQA: Cross-Lingual Visual Question Answering

357

13 Sep 2021

Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval

Xuanjing Huang

12 Sep 2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive SummarizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

293

06 Sep 2021

Improving Joint Learning of Chest X-Ray and Radiology Report by Word Region Alignment

Zhanghexuan Ji

Mohammad Abuzar Shaikh

182

04 Sep 2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Shaikh

152

04 Sep 2021

Multimodal Conditionality for Natural Language Generation

Michael Sollami

Aashish Jain

115

02 Sep 2021

CTAL: Pre-training Cross-modal Transformer for Audio-and-Language RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Zitao Liu

166

01 Sep 2021

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-trainingACM Multimedia (ACM MM), 2021

Qin Jin

Fei Huang

198

25 Aug 2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

122

21 Aug 2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

Ningyu Zhang

Huajun Chen

229

20 Aug 2021

Indoor Semantic Scene Understanding using Multi-modality Fusion

Muraleekrishna Gopinathan

Giang Truong

Jumana Abu-Khalaf

157

17 Aug 2021

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Ji Zhang

Meng Wang

Jun-chen Yu

VLM

166

16 Aug 2021

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Daniel Cohen-Or

267

275

02 Aug 2021

BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning

305

184

01 Aug 2021

UIBert: Learning Generic Multimodal Representations for UI UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2021

Blaise Agüera y Arcas

258

111

29 Jul 2021

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

Cameron R. Wolfe

Keld T. Lundgaard

VLM

144

27 Jul 2021

DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic FrameworkACM Multimedia (ACM MM), 2021

104

21 Jul 2021

Separating Skills and Concepts for Novel Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2021

Heng Ji

179

19 Jul 2021

Align before Fuse: Vision and Language Representation Learning with Momentum DistillationNeural Information Processing Systems (NeurIPS), 2021

Junnan Li

Ramprasaath R. Selvaraju

Akhilesh Deepak Gotmare

826

2,461

16 Jul 2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

186

06 Jul 2021

PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling

239

06 Jul 2021

Productivity, Portability, Performance: Data-Centric Python

402

111

01 Jul 2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Jing Liu

...

285

01 Jul 2021

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training

403

25 Jun 2021

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

24 Jun 2021

Towards Long-Form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021

Chaoxia Wu

Philipp Krahenbuhl

VLM ViT

314

193

21 Jun 2021

GEM: A General Evaluation Benchmark for Multimodal TasksFindings (Findings), 2021

193

18 Jun 2021

Efficient Self-supervised Vision Transformers for Representation LearningInternational Conference on Learning Representations (ICLR), 2021

Jianwei Yang

Lu Yuan

287

221

17 Jun 2021

Probing Image-Language Transformers for Verb Understanding

Lisa Anne Hendricks

Aida Nematzadeh

211

131

16 Jun 2021

Pre-Trained Models: Past, Present and FutureAI Open (AO), 2021

Xu Han

Zhengyan Zhang

Ning Ding

Yuxian Gu

Xiao Liu

...

Jun Zhu

384

985

14 Jun 2021

Assessing Multilingual Fairness in Pre-trained Multimodal RepresentationsFindings (Findings), 2021

Jialu Wang

Yang Liu

Xinze Wang

EGVM

233

12 Jun 2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Qin Jin

11 Jun 2021

Chasing Sparsity in Vision Transformers: An End-to-End ExplorationNeural Information Processing Systems (NeurIPS), 2021

Tianlong Chen

Yu Cheng

Zhe Gan

Lu Yuan

Lei Zhang

Zinan Lin

ViT

242

255

08 Jun 2021

BERTGEN: Multi-task Generation through BERTAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Pranava Madhyastha

108

07 Jun 2021

MERLOT: Multimodal Neural Script Knowledge ModelsNeural Information Processing Systems (NeurIPS), 2021

Yejin Choi

348

425

04 Jun 2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Fei Huang

310

126

03 Jun 2021

GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical ReasoningFindings (Findings), 2021

Xiaodan Liang

218

251

30 May 2021

Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation

Rui Yan

Jiwei Li

220

30 May 2021

M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers

Zhu Zhang

Jianxin Ma

Chang Zhou

Rui Men

Zhikang Li

Ming Ding

Jie Tang

Jingren Zhou

Hongxia Yang

345

29 May 2021

Multi-Modal Semantic Inconsistency Detection in Social Media News PostsConference on Multimedia Modeling (MMM), 2021

S. McCrae

Kehan Wang

A. Zakhor

142

26 May 2021

Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell CheckingFindings (Findings), 2021

195

109

26 May 2021

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

226

25 May 2021

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-TrainingIEEE journal of biomedical and health informatics (JBHI), 2021

Young-Hak Kim

221

210

24 May 2021

VLM: Task-agnostic Video-Language Model Pre-training for Video UnderstandingFindings (Findings), 2021

Hu Xu

Gargi Ghosh

Po-Yao (Bernie) Huang

Prahal Arora

Masoumeh Aminzadeh

Christoph Feichtenhofer

Florian Metze

Luke Zettlemoyer

327

146

20 May 2021

A Review on Explainability in Multimodal Deep Neural NetsIEEE Access (IEEE Access), 2021

Gargi Joshi

Rahee Walambe

K. Kotecha

373

171

17 May 2021

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image RetrievalInternational Conference on Machine Learning and Applications (ICMLA), 2021

K. Ueki

254

16 May 2021

Recent Advances in Deep Learning Based Dialogue Systems: A Systematic SurveyArtificial Intelligence Review (AIR), 2021

808

320

10 May 2021

Playing Lottery Tickets with Vision and LanguageAAAI Conference on Artificial Intelligence (AAAI), 2021

Zicheng Liu

300

23 Apr 2021

Detector-Free Weakly Supervised Grounding by SeparationIEEE International Conference on Computer Vision (ICCV), 2021

...

174

20 Apr 2021