Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Mohit Bansal
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 1,506 papers shown
Title
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
34
271
0
07 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
31
96
0
05 Apr 2021
VisQA: X-raying Vision and Language Reasoning in Transformers
Theo Jaunet
Corentin Kervadec
Romain Vuillemot
G. Antipov
M. Baccouche
Christian Wolf
8
26
0
02 Apr 2021
Towards General Purpose Vision Systems
Tanmay Gupta
Amita Kamath
Aniruddha Kembhavi
Derek Hoiem
11
49
0
01 Apr 2021
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Mingyang Zhou
Luowei Zhou
Shuohang Wang
Yu Cheng
Linjie Li
Zhou Yu
Jingjing Liu
MLLM
VLM
23
89
0
01 Apr 2021
A Survey on Natural Language Video Localization
Xinfang Liu
Xiushan Nie
Zhifang Tan
Jie Guo
Yilong Yin
20
7
0
01 Apr 2021
Zero-Shot Language Transfer vs Iterative Back Translation for Unsupervised Machine Translation
Aviral Joshi
Chengzhi Huang
H. Singh
19
2
0
31 Mar 2021
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
Or Patashnik
Zongze Wu
Eli Shechtman
Daniel Cohen-Or
Dani Lischinski
CLIP
VLM
17
1,190
0
31 Mar 2021
Diagnosing Vision-and-Language Navigation: What Really Matters
Wanrong Zhu
Yuankai Qi
P. Narayana
Kazoo Sone
Sugato Basu
X. Wang
Qi Wu
M. Eckstein
W. Wang
LM&Ro
22
50
0
30 Mar 2021
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers
Antoine Miech
Jean-Baptiste Alayrac
Ivan Laptev
Josef Sivic
Andrew Zisserman
ViT
20
136
0
30 Mar 2021
Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays
Xiaosong Wang
Ziyue Xu
Leo K. Tam
Dong Yang
Daguang Xu
ViT
MedIm
17
23
0
30 Mar 2021
Domain-robust VQA with diverse datasets and methods but no target labels
Mingda Zhang
Tristan D. Maidment
Ahmad Diab
Adriana Kovashka
R. Hwa
OOD
11
22
0
29 Mar 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
Hila Chefer
Shir Gur
Lior Wolf
ViT
9
302
0
29 Mar 2021
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
Pengchuan Zhang
Xiyang Dai
Jianwei Yang
Bin Xiao
Lu Yuan
Lei Zhang
Jianfeng Gao
ViT
23
328
0
29 Mar 2021
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
Song Liu
Haoqi Fan
Shengsheng Qian
Yiru Chen
Wenkui Ding
Zhongyuan Wang
22
145
0
28 Mar 2021
'Just because you are right, doesn't mean I am wrong': Overcoming a Bottleneck in the Development and Evaluation of Open-Ended Visual Question Answering (VQA) Tasks
Man Luo
Shailaja Keyur Sampat
Riley Tallman
Yankai Zeng
Manuha Vancha
Akarshan Sajja
Chitta Baral
14
10
0
28 Mar 2021
Understanding Robustness of Transformers for Image Classification
Srinadh Bhojanapalli
Ayan Chakrabarti
Daniel Glasner
Daliang Li
Thomas Unterthiner
Andreas Veit
ViT
14
377
0
26 Mar 2021
Visual Grounding Strategies for Text-Only Natural Language Processing
Damien Sileo
21
8
0
25 Mar 2021
VLGrammar: Grounded Grammar Induction of Vision and Language
Yining Hong
Qing Li
Song-Chun Zhu
Siyuan Huang
VLM
13
25
0
24 Mar 2021
Scene-Intuitive Agent for Remote Embodied Visual Grounding
Xiangru Lin
Guanbin Li
Yizhou Yu
LM&Ro
22
52
0
24 Mar 2021
Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling
Pratyay Banerjee
Kuntal Kumar Pal
Fish Wang
Chitta Baral
17
12
0
23 Mar 2021
Multi-Modal Answer Validation for Knowledge-Based VQA
Jialin Wu
Jiasen Lu
Ashish Sabharwal
Roozbeh Mottaghi
6
139
0
23 Mar 2021
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
Gregor Geigle
Jonas Pfeiffer
Nils Reimers
Ivan Vulić
Iryna Gurevych
27
59
0
22 Mar 2021
How to Design Sample and Computationally Efficient VQA Models
Karan Samel
Zelin Zhao
Binghong Chen
Kuan-Chieh Jackson Wang
Haozheng Luo
Le Song
13
4
0
22 Mar 2021
An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-Level Structural Information
Zejun Li
Zhongyu Wei
Zhihao Fan
Haijun Shan
Xuanjing Huang
14
5
0
21 Mar 2021
Variational Knowledge Distillation for Disease Classification in Chest X-Rays
Tom van Sonsbeek
Xiantong Zhen
M. Worring
Ling Shao
11
13
0
19 Mar 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
16
33
0
18 Mar 2021
Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA
Yonatan Bitton
Gabriel Stanovsky
Roy Schwartz
Michael Elhadad
CoGe
17
33
0
17 Mar 2021
LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval
Siqi Sun
Yen-Chun Chen
Linjie Li
Shuohang Wang
Yuwei Fang
Jingjing Liu
VLM
30
82
0
16 Mar 2021
SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels
Chenliang Li
Ming Yan
Haiyang Xu
Fuli Luo
Wei Wang
Bin Bi
Songfang Huang
VLM
26
36
0
14 Mar 2021
What is Multimodality?
Letitia Parcalabescu
Nils Trost
Anette Frank
16
0
0
10 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
Andrew Shin
Masato Ishii
T. Narihira
33
36
0
06 Mar 2021
Causal Attention for Vision-Language Tasks
Xu Yang
Hanwang Zhang
Guojun Qi
Jianfei Cai
CML
23
148
0
05 Mar 2021
M6: A Chinese Multimodal Pretrainer
Junyang Lin
Rui Men
An Yang
Chan Zhou
Ming Ding
...
Yong Li
Wei Lin
Jingren Zhou
J. Tang
Hongxia Yang
VLM
MoE
29
132
0
01 Mar 2021
Detecting Harmful Content On Online Platforms: What Platforms Need Vs. Where Research Efforts Go
Arnav Arora
Preslav Nakov
Momchil Hardalov
Sheikh Muhammad Sarwar
Vibha Nayak
...
Dimitrina Zlatkova
Kyle Dent
Ameya Bhatawdekar
Guillaume Bouchard
Isabelle Augenstein
17
46
0
27 Feb 2021
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIP
VLM
98
27,569
0
26 Feb 2021
UniT: Multimodal Multitask Learning with a Unified Transformer
Ronghang Hu
Amanpreet Singh
ViT
14
295
0
22 Feb 2021
Learning Compositional Representation for Few-shot Visual Question Answering
Dalu Guo
Dacheng Tao
OOD
CoGe
17
4
0
21 Feb 2021
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
Jun Chen
Han Guo
Kai Yi
Boyang Albert Li
Mohamed Elhoseiny
VLM
26
218
0
20 Feb 2021
Hierarchical Similarity Learning for Language-based Product Image Retrieval
Zhe Ma
Fenghao Liu
Jianfeng Dong
Xiaoye Qu
Yuan He
S. Ji
VLM
16
4
0
18 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
273
1,081
0
17 Feb 2021
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Jie Lei
Linjie Li
Luowei Zhou
Zhe Gan
Tamara L. Berg
Mohit Bansal
Jingjing Liu
CLIP
32
645
0
11 Feb 2021
Biomedical Question Answering: A Survey of Approaches and Challenges
Qiao Jin
Zheng Yuan
Guangzhi Xiong
Qian Yu
Huaiyuan Ying
Chuanqi Tan
Mosha Chen
Songfang Huang
Xiaozhong Liu
Sheng Yu
21
95
0
10 Feb 2021
Fashion Focus: Multi-modal Retrieval System for Video Commodity Localization in E-commerce
Yanhao Zhang
Qiang Wang
Pan Pan
Yun Zheng
Cheng Da
Siyang Sun
Yinghui Xu
45
9
0
09 Feb 2021
Iconographic Image Captioning for Artworks
E. Cetinic
13
24
0
07 Feb 2021
CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models
Yusheng Su
Xu Han
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Peng Li
Jie Zhou
Maosong Sun
11
10
0
07 Feb 2021
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Wonjae Kim
Bokyung Son
Ildoo Kim
VLM
CLIP
39
1,703
0
05 Feb 2021
RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
Lin Sun
Jiquan Wang
Kai Zhang
Yindu Su
Fangsheng Weng
14
132
0
05 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho
Jie Lei
Hao Tan
Mohit Bansal
MLLM
249
525
0
04 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Lisa Anne Hendricks
John F. J. Mellor
R. Schneider
Jean-Baptiste Alayrac
Aida Nematzadeh
75
110
0
31 Jan 2021
Previous
1
2
3
...
25
26
27
...
29
30
31
Next