Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Mohit Bansal
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,506 papers shown
Title
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Yikuan Li
Hanyin Wang
Yuan Luo
6
62
0
03 Sep 2020
Active Contrastive Learning of Audio-Visual Video Representations
Shuang Ma
Zhaoyang Zeng
Daniel J. McDuff
Yale Song
VLM
SSL
9
8
0
31 Aug 2020
Visual Question Answering on Image Sets
Ankan Bansal
Yuting Zhang
Rama Chellappa
CoGe
8
40
0
27 Aug 2020
Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks
K. Gouthaman
Athira M. Nambiar
K. Srinivas
Anurag Mittal
VLM
12
12
0
18 Aug 2020
DeVLBert: Learning Deconfounded Visio-Linguistic Representations
Shengyu Zhang
Tan Jiang
Tan Wang
Kun Kuang
Zhou Zhao
Jianke Zhu
Jin Yu
Hongxia Yang
Fei Wu
OOD
9
85
0
16 Aug 2020
Weakly supervised cross-domain alignment with optimal transport
Siyang Yuan
Ke Bai
Liqun Chen
Yizhe Zhang
Chenyang Tao
Chunyuan Li
Guoyin Wang
Ricardo Henao
Lawrence Carin
OT
16
7
0
14 Aug 2020
A Machine of Few Words -- Interactive Speaker Recognition with Reinforcement Learning
Mathieu Seurin
Florian Strub
Philippe Preux
Olivier Pietquin
6
5
0
07 Aug 2020
Learning Visual Representations with Caption Annotations
Mert Bulent Sariyildiz
J. Perez
Diane Larlus
VLM
SSL
6
158
0
04 Aug 2020
Describing Textures using Natural Language
Chenyun Wu
Mikayla Timm
Subhransu Maji
3DV
12
10
0
03 Aug 2020
SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space
Liu Yang
VLM
16
5
0
02 Aug 2020
Pre-training for Video Captioning Challenge 2020 Summary
Yingwei Pan
Jun Xu
Yehao Li
Ting Yao
Tao Mei
8
1
0
27 Jul 2020
REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering
Siwen Luo
S. Han
Kaiyuan Sun
Josiah Poon
CoGe
LRM
ReLM
11
4
0
27 Jul 2020
Contrastive Visual-Linguistic Pretraining
Lei Shi
Kai Shuang
Shijie Geng
Peng Su
Zhengkai Jiang
Peng Gao
Zuohui Fu
Gerard de Melo
Sen Su
VLM
SSL
CLIP
18
29
0
26 Jul 2020
Spatially Aware Multimodal Transformers for TextVQA
Yash Kant
Dhruv Batra
Peter Anderson
A. Schwing
Devi Parikh
Jiasen Lu
Harsh Agrawal
6
85
0
23 Jul 2020
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation
Yongjing Yin
Fandong Meng
Jinsong Su
Chulun Zhou
Zhengyuan Yang
Jie Zhou
Jiebo Luo
14
138
0
17 Jul 2020
Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder
K. Gouthaman
Anurag Mittal
36
78
0
13 Jul 2020
IQ-VQA: Intelligent Visual Question Answering
Vatsal Goel
Mohit Chandak
A. Anand
Prithwijit Guha
17
5
0
08 Jul 2020
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
Yingwei Pan
Yehao Li
Jianjie Luo
Jun Xu
Ting Yao
Tao Mei
16
57
0
05 Jul 2020
Latent Compositional Representations Improve Systematic Generalization in Grounded Question Answering
Ben Bogin
Sanjay Subramanian
Matt Gardner
Jonathan Berant
ReLM
OOD
BDL
LRM
6
28
0
01 Jul 2020
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
Wanrong Zhu
X. Wang
Tsu-jui Fu
An Yan
P. Narayana
Kazoo Sone
Sugato Basu
W. Wang
21
33
0
01 Jul 2020
Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds
Jack Culpepper
Simão Herdade
Yang Zhang
K. Boakye
EgoV
6
71
0
30 Jun 2020
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph
Fei Yu
Jiji Tang
Weichong Yin
Yu Sun
Hao Tian
Hua-Hong Wu
Haifeng Wang
4
375
0
30 Jun 2020
Improving VQA and its Explanations \\ by Comparing Competing Explanations
Jialin Wu
Liyan Chen
Raymond J. Mooney
FAtt
AAML
22
17
0
28 Jun 2020
Unsupervised Video Decomposition using Spatio-temporal Iterative Inference
Polina Zablotskaia
E. Dominici
Leonid Sigal
Andreas M. Lehrmann
OCL
12
19
0
25 Jun 2020
Neuro-Symbolic Visual Reasoning: Disentangling "Visual" from "Reasoning"
Saeed Amizadeh
Hamid Palangi
Oleksandr Polozov
Yichen Huang
K. Koishida
NAI
LRM
20
58
0
20 Jun 2020
Overcoming Statistical Shortcuts for Open-ended Visual Counting
Corentin Dancette
Rémi Cadène
Xinlei Chen
Matthieu Cord
6
3
0
17 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjD
SSL
27
139
0
17 Jun 2020
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai
Justin Johnson
SSL
VLM
11
432
0
11 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
24
485
0
11 Jun 2020
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
Corentin Kervadec
G. Antipov
M. Baccouche
Christian Wolf
OOD
10
86
0
09 Jun 2020
Translating Natural Language Instructions for Behavioral Robot Navigation with a Multi-Head Attention Mechanism
Patricio Cerda-Mardini
Vladimir Araujo
Alvaro Soto
11
5
0
01 Jun 2020
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava
6
4
0
15 May 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
14
127
0
15 May 2020
Cross-Modality Relevance for Reasoning on Language and Vision
Chen Zheng
Quan Guo
Parisa Kordjamshidi
LRM
21
36
0
12 May 2020
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jie Lei
Liwei Wang
Yelong Shen
Dong Yu
Tamara L. Berg
Mohit Bansal
14
186
0
11 May 2020
History for Visual Dialog: Do we really need it?
Shubham Agarwal
Trung Bui
Joon-Young Lee
Ioannis Konstas
Verena Rieser
VLM
11
69
0
08 May 2020
Cross-media Structured Common Space for Multimedia Event Extraction
Manling Li
Alireza Zareian
Qi Zeng
Spencer Whitehead
Di Lu
Heng Ji
Shih-Fu Chang
6
102
0
05 May 2020
Visual Question Answering with Prior Class Semantics
Violetta Shevchenko
Damien Teney
A. Dick
A. Hengel
BDL
8
7
0
04 May 2020
Visually Grounded Continual Learning of Compositional Phrases
Xisen Jin
Junyi Du
Arka Sadhu
Ram Nevatia
Xiang Ren
CLL
12
4
0
02 May 2020
Obtaining Faithful Interpretations from Compositional Neural Networks
Sanjay Subramanian
Ben Bogin
Nitish Gupta
Tomer Wolfson
Sameer Singh
Jonathan Berant
Matt Gardner
6
42
0
02 May 2020
Probing Contextual Language Models for Common Ground with Visual Representations
Gabriel Ilharco
Rowan Zellers
Ali Farhadi
Hannaneh Hajishirzi
6
14
0
01 May 2020
Visuo-Linguistic Question Answering (VLQA) Challenge
Shailaja Keyur Sampat
Yezhou Yang
Chitta Baral
CoGe
11
1
0
01 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
29
491
0
01 May 2020
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
Zarana Parekh
Jason Baldridge
Daniel Matthew Cer
Austin Waters
Yinfei Yang
6
61
0
30 Apr 2020
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Arjun Majumdar
Ayush Shrivastava
Stefan Lee
Peter Anderson
Devi Parikh
Dhruv Batra
LM&Ro
29
230
0
30 Apr 2020
Span-based Localizing Network for Natural Language Video Localization
Hao Zhang
Aixin Sun
Wei Jing
Joey Tianyi Zhou
4
311
0
29 Apr 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Yue Wang
Shafiq R. Joty
Michael R. Lyu
Irwin King
Caiming Xiong
S. Hoi
19
102
0
28 Apr 2020
A Novel Attention-based Aggregation Function to Combine Vision and Language
Matteo Stefanini
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLM
9
9
0
27 Apr 2020
Deep Multimodal Neural Architecture Search
Zhou Yu
Yuhao Cui
Jun-chen Yu
Meng Wang
Dacheng Tao
Qi Tian
8
98
0
25 Apr 2020
VisualCOMET: Reasoning about the Dynamic Context of a Still Image
J. S. Park
Chandra Bhagavatula
Roozbeh Mottaghi
Ali Farhadi
Yejin Choi
ReLM
LRM
11
6
0
22 Apr 2020
Previous
1
2
3
...
28
29
30
31
Next