Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.05054
Cited By
v1
v2 (latest)
Fusion of Detected Objects in Text for Visual Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
14 August 2019
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1675★)
Papers citing
"Fusion of Detected Objects in Text for Visual Question Answering"
50 / 109 papers shown
Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Tongtong Liu
Fangxiang Feng
Caixia Yuan
76
16
0
22 Jul 2021
Productivity, Portability, Performance: Data-Centric Python
Yiheng Wang
Yao Zhang
Yanzhang Wang
Yan Wan
Jiao Wang
Zhongyuan Wu
Yuhao Yang
Bowen She
415
112
0
01 Jul 2021
Pre-Trained Models: Past, Present and Future
AI Open (AO), 2021
Xu Han
Zhengyan Zhang
Ning Ding
Yuxian Gu
Xiao Liu
...
Jie Tang
Ji-Rong Wen
Jinhui Yuan
Wayne Xin Zhao
Jun Zhu
AIFin
MQ
AI4MH
390
995
0
14 Jun 2021
MERLOT: Multimodal Neural Script Knowledge Models
Neural Information Processing Systems (NeurIPS), 2021
Rowan Zellers
Ximing Lu
Jack Hessel
Youngjae Yu
J. S. Park
Jize Cao
Ali Farhadi
Yejin Choi
VLM
LRM
358
430
0
04 Jun 2021
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World
Annual Meeting of the Association for Computational Linguistics (ACL), 2021
Rowan Zellers
Ari Holtzman
Matthew E. Peters
Roozbeh Mottaghi
Aniruddha Kembhavi
Ali Farhadi
Yejin Choi
316
77
0
01 Jun 2021
Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation
Shuhe Wang
Yuxian Meng
Xiaofei Sun
Leilei Gan
Rongbin Ouyang
Rui Yan
Tianwei Zhang
Jiwei Li
234
15
0
30 May 2021
A Review on Explainability in Multimodal Deep Neural Nets
IEEE Access (IEEE Access), 2021
Gargi Joshi
Rahee Walambe
K. Kotecha
401
171
0
17 May 2021
Recent Advances in Deep Learning Based Dialogue Systems: A Systematic Survey
Artificial Intelligence Review (AIR), 2021
Jinjie Ni
Tom Young
Vlad Pandelea
Fuzhao Xue
Xiaoshi Zhong
870
323
0
10 May 2021
Detector-Free Weakly Supervised Grounding by Separation
IEEE International Conference on Computer Vision (ICCV), 2021
Assaf Arbelle
Sivan Doveh
Amit Alfassy
J. Shtok
Guy Lev
...
Kate Saenko
S. Ullman
Raja Giryes
Rogerio Feris
Leonid Karlinsky
196
31
0
20 Apr 2021
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
457
302
0
07 Apr 2021
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Computer Vision and Pattern Recognition (CVPR), 2021
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Linbo Jin
Ben Chen
Hao Zhou
Minghui Qiu
Ling Shao
VLM
350
134
0
30 Mar 2021
Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision
International Journal of Computer Vision (IJCV), 2021
Andrew Shin
Masato Ishii
T. Narihira
310
51
0
06 Mar 2021
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021
Krishna Srinivasan
K. Raman
Jiecao Chen
Michael Bendersky
Marc Najork
VLM
533
391
0
02 Mar 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Computer Vision and Pattern Recognition (CVPR), 2021
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
1.2K
1,370
0
17 Feb 2021
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge
Violetta Shevchenko
Damien Teney
A. Dick
Anton Van Den Hengel
215
31
0
15 Jan 2021
Transformers in Vision: A Survey
ACM Computing Surveys (CSUR), 2021
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
Fahad Shahbaz Khan
M. Shah
ViT
943
3,202
0
04 Jan 2021
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
Yuxian Meng
Shuhe Wang
Qinghong Han
Xiaofei Sun
Leilei Gan
Rui Yan
Jiwei Li
374
31
0
30 Dec 2020
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
AAAI Conference on Artificial Intelligence (AAAI), 2020
Zecheng He
Srinivas Sunkara
Xiaoxue Zang
Ying Xu
Lijuan Liu
Nevan Wichers
Gabriel Schubiner
Ruby B. Lee
Jindong Chen
Blaise Agüera y Arcas
265
90
0
22 Dec 2020
KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA
Computer Vision and Pattern Recognition (CVPR), 2020
Kenneth Marino
Xinlei Chen
Devi Parikh
Abhinav Gupta
Marcus Rohrbach
272
228
0
20 Dec 2020
A Closer Look at the Robustness of Vision-and-Language Pre-trained Models
Linjie Li
Zhe Gan
Jingjing Liu
VLM
265
50
0
15 Dec 2020
KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning
Knowledge-Based Systems (KBS), 2020
Dandan Song
S. Ma
Zhanchen Sun
Sicheng Yang
L. Liao
SSL
LRM
262
42
0
13 Dec 2020
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Zhengyuan Yang
Yijuan Lu
Jianfeng Wang
Xi Yin
D. Florêncio
Lijuan Wang
Cha Zhang
Lei Zhang
Jiebo Luo
VLM
266
159
0
08 Dec 2020
Parameter Efficient Multimodal Transformers for Video Representation Learning
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
281
88
0
08 Dec 2020
Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da
Maxwell Forbes
Rowan Zellers
Anthony Zheng
Jena D. Hwang
Antoine Bosselut
Yejin Choi
DiffM
215
5
0
08 Dec 2020
Classification of Multimodal Hate Speech -- The Winning Solution of Hateful Memes Challenge
Xiayu Zhong
155
16
0
02 Dec 2020
Improving Calibration in Deep Metric Learning With Cross-Example Softmax
Andreas Veit
Kimberly Wilber
72
3
0
17 Nov 2020
Human-centric Spatio-Temporal Video Grounding With Visual Transformers
Zongheng Tang
Yue Liao
Si Liu
Guanbin Li
Xiaojie Jin
Hongxu Jiang
Qian Yu
Dong Xu
228
128
0
10 Nov 2020
MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Findings (Findings), 2020
Aisha Urooj Khan
Amir Mazaheri
N. Lobo
M. Shah
216
62
0
27 Oct 2020
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends
Shagun Uppal
Sarthak Bhagat
Devamanyu Hazarika
Navonil Majumdar
Soujanya Poria
Roger Zimmermann
Amir Zadeh
282
6
0
19 Oct 2020
Unsupervised Natural Language Inference via Decoupled Multimodal Contrastive Learning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Wanyun Cui
Guangyu Zheng
Wei Wang
SSL
140
21
0
16 Oct 2020
Beyond Language: Learning Commonsense from Images for Reasoning
Findings (Findings), 2020
Wanqing Cui
Yanyan Lan
Liang Pang
Jiafeng Guo
Xueqi Cheng
LRM
146
5
0
10 Oct 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Jaemin Cho
Jiasen Lu
Dustin Schwenk
Hannaneh Hajishirzi
Aniruddha Kembhavi
VLM
MLLM
231
106
0
23 Sep 2020
A Multimodal Memes Classification: A Survey and Open Research Issues
Tariq Habib Afridi
A. Alam
Muhammad Numan Khan
Jawad Khan
Young-Koo Lee
214
43
0
17 Sep 2020
Modality-Agnostic Attention Fusion for visual search with text feedback
Eric Dodds
Jack Culpepper
Simão Herdade
Yang Zhang
K. Boakye
EgoV
265
88
0
30 Jun 2020
Video-Grounded Dialogues with Pretrained Generation Language Models
Hung Le
Guosheng Lin
226
31
0
27 Jun 2020
Contrastive Learning for Weakly Supervised Phrase Grounding
Tanmay Gupta
Arash Vahdat
Gal Chechik
Xiaodong Yang
Jan Kautz
Derek Hoiem
ObjD
SSL
350
157
0
17 Jun 2020
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Neural Information Processing Systems (NeurIPS), 2020
Zhe Gan
Yen-Chun Chen
Linjie Li
Chen Zhu
Yu Cheng
Jingjing Liu
ObjD
VLM
373
537
0
11 Jun 2020
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
277
139
0
15 May 2020
Visuo-Linguistic Question Answering (VLQA) Challenge
Shailaja Keyur Sampat
Yezhou Yang
Chitta Baral
CoGe
138
1
0
01 May 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Linjie Li
Yen-Chun Chen
Yu Cheng
Zhe Gan
Licheng Yu
Jingjing Liu
MLLM
VLM
OffRL
AI4TS
713
539
0
01 May 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
Yue Wang
Shafiq Joty
Michael R. Lyu
Irwin King
Caiming Xiong
Guosheng Lin
387
107
0
28 Apr 2020
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh
Vedanuj Goswami
Devi Parikh
VLM
152
48
0
19 Apr 2020
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
420
469
0
02 Apr 2020
Pre-trained Models for Natural Language Processing: A Survey
Science China Technological Sciences (Sci China Technol Sci), 2020
Xipeng Qiu
Tianxiang Sun
Yige Xu
Yunfan Shao
Ning Dai
Xuanjing Huang
LM&MA
VLM
1.2K
1,625
0
18 Mar 2020
XGPT: Cross-modal Generative Pre-Training for Image Captioning
Natural Language Processing and Chinese Computing (NLPCC), 2020
Qiaolin Xia
Haoyang Huang
Nan Duan
Dongdong Zhang
Lei Ji
Zhifang Sui
Edward Cui
Taroon Bharti
Xin Liu
Ming Zhou
MLLM
VLM
248
84
0
03 Mar 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
Thomas Scialom
Patrick Bordes
Paul-Alexis Dray
Jacopo Staiano
Patrick Gallinari
255
7
0
25 Feb 2020
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data
Di Qi
Lin Su
Jianwei Song
Edward Cui
Taroon Bharti
Arun Sacheti
VLM
389
277
0
22 Jan 2020
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog
AAAI Conference on Artificial Intelligence (AAAI), 2019
Feilong Chen
Fandong Meng
Jiaming Xu
Peng Li
Bo Xu
Jie Zhou
189
34
0
18 Dec 2019
12-in-1: Multi-Task Vision and Language Representation Learning
Computer Vision and Pattern Recognition (CVPR), 2019
Jiasen Lu
Vedanuj Goswami
Marcus Rohrbach
Devi Parikh
Stefan Lee
VLM
ObjD
338
499
0
05 Dec 2019
Learning to Learn Words from Visual Scenes
Dídac Surís
Dave Epstein
Heng Ji
Shih-Fu Chang
Carl Vondrick
VLM
CLIP
SSL
OffRL
186
4
0
25 Nov 2019
Previous
1
2
3
Next
Page 2 of 3