Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.05054
Cited By
v1
v2 (latest)
Fusion of Detected Objects in Text for Visual Question Answering
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
14 August 2019
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
Re-assign community
ArXiv (abs)
PDF
HTML
Github (1675★)
Papers citing
"Fusion of Detected Objects in Text for Visual Question Answering"
50 / 109 papers shown
Title
GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination
Xinxi Chen
Tianyang Chen
Lijia Hong
HILM
20
0
0
30 Sep 2025
Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry
Wenjun Hou
Yi Cheng
Kaishuai Xu
Yan Hu
Wenjie Li
Jiang-Dong Liu
169
4
0
17 Nov 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLM
CLIP
159
8
0
11 Jun 2024
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
158
2
0
22 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
115
0
0
29 Mar 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
255
9
0
27 Feb 2024
V
D
\mathbb{VD}
VD
-
G
R
\mathbb{GR}
GR
: Boosting
V
\mathbb{V}
V
isual
D
\mathbb{D}
D
ialog with Cascaded Spatial-Temporal Multi-Modal
G
R
\mathbb{GR}
GR
aphs
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Adnen Abdessaied
Lei Shi
Andreas Bulling
3DH
122
6
0
25 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
292
12
0
17 Oct 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
160
5
0
28 Sep 2023
Separate and Locate: Rethink the Text in Text-based Visual Question Answering
ACM Multimedia (ACM MM), 2023
Chengyang Fang
Jiangnan Li
Liang Li
Can Ma
Dayong Hu
223
16
0
31 Aug 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
272
1
0
04 Jun 2023
Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
131
1
0
31 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
177
21
0
29 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
Jinho Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
160
7
0
01 May 2023
Enhancing object detection robustness: A synthetic and natural perturbation approach
N. Premakumara
B. Jalaeian
N. Suri
H. Samani
130
4
0
20 Apr 2023
Probabilistic Prompt Learning for Dense Prediction
Computer Vision and Pattern Recognition (CVPR), 2023
Hyeongjun Kwon
Taeyong Song
Somi Jeong
Jin-Hwa Kim
Jinhyun Jang
Kwanghoon Sohn
VLM
222
25
0
03 Apr 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal Classification
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Chunpu Xu
Jing Li
VLM
92
5
0
27 Mar 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Machine Intelligence Research (MIR), 2023
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
384
259
0
20 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future Directions
Journal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
284
61
0
14 Feb 2023
A survey on knowledge-enhanced multimodal learning
Artificial Intelligence Review (Artif Intell Rev), 2022
Maria Lymperaiou
Giorgos Stamou
409
19
0
19 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Computer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
181
54
0
17 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
ACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
175
13
0
28 Oct 2022
Masked Vision-Language Transformer in Fashion
Machine Intelligence Research (MIR), 2022
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
218
27
0
27 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
116
6
0
24 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Neural Information Processing Systems (NeurIPS), 2022
Xuran Pan
Tianzhu Ye
Dongchen Han
Qing Xiao
Gao Huang
VLM
CLIP
175
62
0
17 Oct 2022
Learning to Evaluate Performance of Multi-modal Semantic Localization
IEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022
Zhiqiang Yuan
Wenkai Zhang
Chongyang Li
Zhaoying Pan
Yongqiang Mao
Jialiang Chen
Shuoke Li
Hongqi Wang
Xian Sun
202
28
0
14 Sep 2022
Computational Sarcasm Analysis on Social Media: A Systematic Review
Faria Binte Kader
Nafisa Hossain Nujat
Tasmia Binte Sogir
Mohsinul Kabir
H. Mahmud
Md. Kamrul Hasan
156
7
0
13 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
IEEE International Conference on Computer Vision (ICCV), 2022
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
277
36
0
12 Sep 2022
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
European Conference on Computer Vision (ECCV), 2022
Patsorn Sangkloy
Wittawat Jitkrittum
Diyi Yang
James Hays
3DV
156
40
0
05 Aug 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLM
CLIP
244
2
0
05 Jul 2022
Multimodal Learning with Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
447
800
0
13 Jun 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLM
MLLM
173
43
0
23 May 2022
Learning to Answer Visual Questions from Web Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
226
38
0
10 May 2022
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
124
10
0
23 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
IEEE Transactions on Image Processing (IEEE TIP), 2022
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
134
57
0
16 Apr 2022
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Xiwen Liang
Fengda Zhu
Lingling Li
Hang Xu
Xiaodan Liang
LM&Ro
VLM
108
32
0
08 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
157
40
0
03 Mar 2022
VLP: A Survey on Vision-Language Pre-training
Machine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
311
279
0
18 Feb 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Computer Vision and Pattern Recognition (CVPR), 2022
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
400
236
0
07 Jan 2022
LaTr: Layout-Aware Transformer for Scene-Text VQA
Computer Vision and Pattern Recognition (CVPR), 2021
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
319
113
0
23 Dec 2021
Decompose the Sounds and Pixels, Recompose the Events
AAAI Conference on Artificial Intelligence (AAAI), 2021
Varshanth R. Rao
Md Ibrahim Khalil
Haoda Li
Peng Dai
Juwei Lu
121
5
0
21 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
232
149
0
02 Dec 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
293
131
0
23 Nov 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
132
1
0
04 Sep 2021
Audio-Visual Transformer Based Crowd Counting
Usman Sajid
Xiangyu Chen
Hasan Sajid
Taejoon Kim
Guanghui Wang
ViT
218
24
0
04 Sep 2021
Auto-Parsing Network for Image Captioning and Visual Question Answering
IEEE International Conference on Computer Vision (ICCV), 2021
Xu Yang
Chongyang Gao
Hanwang Zhang
Jianfei Cai
197
41
0
24 Aug 2021
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network
IEEE International Conference on Computer Vision (ICCV), 2021
Yuxin Wang
Hongtao Xie
Shancheng Fang
Jing Wang
Shenggao Zhu
Yongdong Zhang
VLM
197
172
0
22 Aug 2021
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur
Makarand Tapaswi
Shizhe Chen
Ivan Laptev
Cordelia Schmid
LM&Ro
153
163
0
20 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
185
37
0
20 Aug 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
124
3
0
27 Jul 2021
1
2
3
Next