ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019

6 August 2019

Devi Parikh

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,232 papers shown

Efficient Multi-Modal Embeddings from Structured Data

A. Vero

Ann A. Copestake

118

06 Oct 2021

Word Acquisition in Neural Language Models

Tyler A. Chang

Benjamin Bergen

268

05 Oct 2021

A Survey On Neural Word Embeddings

Erhan Sezerer

Selma Tekir

AI4TS

271

05 Oct 2021

ProTo: Program-Guided Transformer for Program-Guided Tasks

260

02 Oct 2021

Visually Grounded Concept Composition

192

29 Sep 2021

Visually Grounded Reasoning across Languages and Cultures

Siva Reddy

483

202

28 Sep 2021

Audio-to-Image Cross-Modal GenerationIEEE International Joint Conference on Neural Network (IJCNN), 2021

Maciej Żelaszczyk

Jacek Mańdziuk

DiffM

202

27 Sep 2021

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question AnsweringConference on Computational Natural Language Learning (CoNLL), 2021

172

27 Sep 2021

Why Do We Click: Visual Impression-aware News RecommendationACM Multimedia (ACM MM), 2021

Jiahao Xun

Shengyu Zhang

Zhou Zhao

Jieming Zhu

238

26 Sep 2021

Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021

172

25 Sep 2021

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

121

24 Sep 2021

CLIPort: What and Where Pathways for Robotic ManipulationConference on Robot Learning (CoRL), 2021

344

819

24 Sep 2021

Detecting Harmful Memes and Their TargetsFindings (Findings), 2021

Dimitar Dimitrov

182

151

24 Sep 2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yuan Yao

Ao Zhang

Zhengyan Zhang

Zhiyuan Liu

Tat-Seng Chua

Maosong Sun

MLLM VPVLM VLM

594

244

24 Sep 2021

Dense Contrastive Visual-Linguistic PretrainingACM Multimedia (ACM MM), 2021

240

24 Sep 2021

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP), 2021

Tobias Norlund

Lovisa Hagström

Richard Johansson

275

23 Sep 2021

Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and BenchmarkACM Multimedia (ACM MM), 2021

142

23 Sep 2021

Cross-Modal Coherence for Text-to-Image RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2021

Vladimir Pavlovic

179

22 Sep 2021

Caption Enriched Samples for Improving Hateful Memes DetectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Efrat Blaier

Itzik Malkiel

Lior Wolf

VLM

166

22 Sep 2021

COVR: A test-bed for Visually Grounded Compositional Generalization with real imagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

180

22 Sep 2021

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

283

22 Sep 2021

Survey: Transformer based Video-Language Pre-training

Ludan Ruan

Qin Jin

VLM ViT

210

21 Sep 2021

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang

Jiazheng Xing

Yong Liu

VLM

415

467

17 Sep 2021

An End-to-End Transformer Model for 3D Object Detection

436

574

16 Sep 2021

A Survey on Temporal Sentence Grounding in Videos

321

16 Sep 2021

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Ander Salaberria

Gorka Azkune

Oier López de Lacalle

Aitor Soroa Etxabe

Eneko Agirre

301

15 Sep 2021

What Vision-Language Models `See' when they See Scenes

264

15 Sep 2021

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

300

14 Sep 2021

Discovering the Unknown Knowns: Turning Implicit Knowledge in the Dataset into Explicit Training Examples for Visual Question Answering

Jihyung Kil

Cheng Zhang

D. Xuan

Wei-Lun Chao

264

13 Sep 2021

xGQA: Cross-Lingual Visual Question Answering

362

13 Sep 2021

TEASEL: A Transformer-Based Speech-Prefixed Language Model

Mehdi Arjmand

M. Dousti

H. Moradi

147

12 Sep 2021

COSMic: A Coherence-Aware Generation Metric for Image Descriptions

156

11 Sep 2021

A Survey on Multi-modal Summarization

206

11 Sep 2021

MOMENTA: A Multimodal Framework for Detecting Harmful Memes and Their Targets

Dimitar Dimitrov

224

169

11 Sep 2021

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQAAAAI Conference on Artificial Intelligence (AAAI), 2021

Zicheng Liu

611

489

10 Sep 2021

Panoptic Narrative GroundingIEEE International Conference on Computer Vision (ICCV), 2021

258

10 Sep 2021

We went to look for meaning and all we got were these lousy representations: aspects of meaning representation for computational semantics

138

10 Sep 2021

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge DistillationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

H. Khan

D. Gupta

Asif Ekbal

166

10 Sep 2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal TransformersConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Stella Frank

Emanuele Bugliarello

Desmond Elliott

184

09 Sep 2021

TxT: Crossmodal End-to-End Learning with TransformersGerman Conference on Pattern Recognition (DAGM), 2021

130

09 Sep 2021

M5Product: Self-harmonized Contrastive Learning for E-commercial Multi-modal PretrainingComputer Vision and Pattern Recognition (CVPR), 2021

Michael C. Kampffmeyer

Xiaoyong Wei

Minlong Lu

Yaowei Wang

Xiaodan Liang

589

09 Sep 2021

Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation ModelsAAAI Conference on Artificial Intelligence (AAAI), 2021

226

08 Sep 2021

Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

217

08 Sep 2021

Learning grounded word meaning representations on similarity graphsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Mariella Dimiccoli

H. Wendt

Pau Batlle

158

07 Sep 2021

CTRL-C: Camera calibration TRansformer with Line-ClassificationIEEE International Conference on Computer Vision (ICCV), 2021

208

06 Sep 2021

Learning to Generate Scene Graph from Natural Language Supervision

Yiwu Zhong

Jing Shi

Jianwei Yang

Chenliang Xu

Yin Li

SSL

263

06 Sep 2021

Data Efficient Masked Language Modeling for Vision and Language

Gabriel Stanovsky

235

05 Sep 2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Shaikh

171

04 Sep 2021

Weakly Supervised Relative Spatial Reasoning for Visual Question Answering

Yezhou Yang

163

04 Sep 2021

Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic

Wenjia Zhang

Lin Gui

Yulan He

141

04 Sep 2021