ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019

6 August 2019

Devi Parikh

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,232 papers shown

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

Po-Yao (Bernie) Huang

Mandela Patrick

Junjie Hu

Graham Neubig

Florian Metze

Alexander G. Hauptmann

MLLM VLM

323

16 Mar 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text RetrievalNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021

199

16 Mar 2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

151

14 Mar 2021

A Survey on Multimodal Disinformation DetectionInternational Conference on Computational Linguistics (COLING), 2021

Firoj Alam

S. Cresci

Tanmoy Chakraborty

Fabrizio Silvestri

Dimiter Dimitrov

Giovanni Da San Martino

Shaden Shaar

Hamed Firooz

Preslav Nakov

257

116

13 Mar 2021

What is Multimodality?

Letitia Parcalabescu

Nils Trost

Anette Frank

230

10 Mar 2021

Pretrained Transformers as Universal Computation Engines

Kevin Lu

Aditya Grover

Pieter Abbeel

Igor Mordatch

299

230

09 Mar 2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and VisionInternational Journal of Computer Vision (IJCV), 2021

Andrew Shin

Masato Ishii

T. Narihira

289

06 Mar 2021

Causal Attention for Vision-Language TasksComputer Vision and Pattern Recognition (CVPR), 2021

Jianfei Cai

228

193

05 Mar 2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021

529

390

02 Mar 2021

M6: A Chinese Multimodal Pretrainer

Rui Men

...

Yong Li

Jialin Li

Jingren Zhou

J. Tang

Hongxia Yang

VLM MoE

345

147

01 Mar 2021

Learning Transferable Visual Models From Natural Language SupervisionInternational Conference on Machine Learning (ICML), 2021

...

2.0K

41,575

26 Feb 2021

UniT: Multimodal Multitask Learning with a Unified TransformerIEEE International Conference on Computer Vision (ICCV), 2021

Ronghang Hu

Amanpreet Singh

ViT

361

343

22 Feb 2021

Learning Compositional Representation for Few-shot Visual Question Answering

Dalu Guo

Dacheng Tao

OOD CoGe

153

21 Feb 2021

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image CaptioningComputer Vision and Pattern Recognition (CVPR), 2021

454

274

20 Feb 2021

Hierarchical Similarity Learning for Language-based Product Image RetrievalIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2021

154

18 Feb 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual ConceptsComputer Vision and Pattern Recognition (CVPR), 2021

1.1K

1,360

17 Feb 2021

LambdaNetworks: Modeling Long-Range Interactions Without AttentionInternational Conference on Learning Representations (ICLR), 2021

Irwan Bello

509

187

17 Feb 2021

Less is More: ClipBERT for Video-and-Language Learning via Sparse SamplingComputer Vision and Pattern Recognition (CVPR), 2021

458

748

11 Feb 2021

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionInternational Conference on Machine Learning (ICML), 2021

1.3K

4,893

11 Feb 2021

Telling the What while Pointing to the Where: Multimodal Queries for Image RetrievalIEEE International Conference on Computer Vision (ICCV), 2021

197

09 Feb 2021

Referring Segmentation in Images and Videos with Cross-Modal Self-Attention NetworkIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021

131

09 Feb 2021

Iconographic Image Captioning for Artworks

E. Cetinic

157

07 Feb 2021

CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language ModelsIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2021

Yusheng Su

Xu Han

Yankai Lin

Zhengyan Zhang

Zhiyuan Liu

Peng Li

Jie Zhou

Maosong Sun

176

07 Feb 2021

ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionInternational Conference on Machine Learning (ICML), 2021

547

2,107

05 Feb 2021

RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NERAAAI Conference on Artificial Intelligence (AAAI), 2021

159

172

05 Feb 2021

Unifying Vision-and-Language Tasks via Text GenerationInternational Conference on Machine Learning (ICML), 2021

599

609

04 Feb 2021

Inferring spatial relations from textual descriptions of imagesPattern Recognition (Pattern Recogn.), 2021

A. Elu

Gorka Azkune

Oier López de Lacalle

Ignacio Arganda-Carreras

Aitor Soroa Etxabe

Eneko Agirre

139

01 Feb 2021

Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersTransactions of the Association for Computational Linguistics (TACL), 2021

Lisa Anne Hendricks

John F. J. Mellor

R. Schneider

Jean-Baptiste Alayrac

Aida Nematzadeh

238

126

31 Jan 2021

An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing GamesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021

142

31 Jan 2021

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal InputsComputer Vision and Pattern Recognition (CVPR), 2021

Gedas Bertasius

Devi Parikh

249

28 Jan 2021

Bottleneck Transformers for Visual RecognitionComputer Vision and Pattern Recognition (CVPR), 2021

Pieter Abbeel

703

1,124

27 Jan 2021

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder NetworkAAAI Conference on Artificial Intelligence (AAAI), 2021

Yingwei Pan

Tao Mei

157

27 Jan 2021

Cross-lingual Visual Pre-training for Multimodal Machine TranslationConference of the European Chapter of the Association for Computational Linguistics (EACL), 2021

Pranava Madhyastha

199

25 Jan 2021

Adversarial Text-to-Image Synthesis: A ReviewNeural Networks (NN), 2021

322

201

25 Jan 2021

Visual Question Answering based on Local-Scene-Aware Referring Expression GenerationNeural Networks (NN), 2021

Jialin Wu

185

22 Jan 2021

SSTVOS: Sparse Spatiotemporal Transformers for Video Object SegmentationComputer Vision and Pattern Recognition (CVPR), 2021

239

188

21 Jan 2021

Learning rich touch representations through cross-modal self-supervisionConference on Robot Learning (CoRL), 2021

199

21 Jan 2021

Understanding in Artificial Intelligence

188

17 Jan 2021

Latent Variable Models for Visual Question Answering

Zixu Wang

Yishu Miao

Lucia Specia

237

16 Jan 2021

Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge

213

15 Jan 2021

Probabilistic Embeddings for Cross-Modal RetrievalComputer Vision and Pattern Recognition (CVPR), 2021

Sanghyuk Chun

Seong Joon Oh

Rafael Sampaio de Rezende

Yannis Kalantidis

Diane Larlus

UQCV

909

261

13 Jan 2021

Trear: Transformer-based RGB-D Egocentric Action RecognitionIEEE Transactions on Cognitive and Developmental Systems (IEEE TCDS), 2021

389

05 Jan 2021

Transformers in Vision: A SurveyACM Computing Surveys (CSUR), 2021

Salman Khan

924

3,176

04 Jan 2021

VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang

Xiujun Li

Xiaowei Hu

Jianwei Yang

Lei Zhang

Lijuan Wang

Yejin Choi

Jianfeng Gao

ObjD VLM

513

168

02 Jan 2021

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2021

Roger Wattenhofer

281

02 Jan 2021

CDLM: Cross-Document Language ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

Arman Cohan

239

02 Jan 2021

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

797

406

31 Dec 2020

Accurate Word Representations with Universal Visual Guidance

182

30 Dec 2020

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

Rui Yan

Jiwei Li

371

30 Dec 2020

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

...

Min Zhang

846

610

29 Dec 2020