ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.05054
  4. Cited By
Fusion of Detected Objects in Text for Visual Question Answering
v1v2 (latest)

Fusion of Detected Objects in Text for Visual Question Answering

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019
14 August 2019
Chris Alberti
Jeffrey Ling
Michael Collins
David Reitter
ArXiv (abs)PDFHTMLGithub (1675★)

Papers citing "Fusion of Detected Objects in Text for Visual Question Answering"

50 / 109 papers shown
Title
GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination
GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination
Xinxi Chen
Tianyang Chen
Lijia Hong
HILM
20
0
0
30 Sep 2025
Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry
Wenjun Hou
Yi Cheng
Kaishuai Xu
Yan Hu
Wenjie Li
Jiang-Dong Liu
169
4
0
17 Nov 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
159
8
0
11 Jun 2024
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking
  Enhances Visual Commonsense Reasoning
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
158
2
0
22 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint
  Textual and Visual Clues
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
115
0
0
29 Mar 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
255
9
0
27 Feb 2024
$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual
  $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal
  $\mathbb{GR}$aphs
VD\mathbb{VD}VD-GR\mathbb{GR}GR: Boosting V\mathbb{V}Visual D\mathbb{D}Dialog with Cascaded Spatial-Temporal Multi-Modal GR\mathbb{GR}GRaphsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Adnen Abdessaied
Lei Shi
Andreas Bulling
3DH
122
6
0
25 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of
  Multi-modal Large Models
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yanyang Guo
Fangkai Jiao
Zhiqi Shen
Liqiang Nie
Mohan S. Kankanhalli
MLLM
292
12
0
17 Oct 2023
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens
Yangyang Guo
Haoyu Zhang
Yongkang Wong
Liqiang Nie
Mohan Kankanhalli
VLM
160
5
0
28 Sep 2023
Separate and Locate: Rethink the Text in Text-based Visual Question
  Answering
Separate and Locate: Rethink the Text in Text-based Visual Question AnsweringACM Multimedia (ACM MM), 2023
Chengyang Fang
Jiangnan Li
Liang Li
Can Ma
Dayong Hu
223
16
0
31 Aug 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
272
1
0
04 Jun 2023
Using Visual Cropping to Enhance Fine-Detail Question Answering of
  BLIP-Family Models
Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
131
1
0
31 May 2023
Deeply Coupled Cross-Modal Prompt Learning
Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Xuejing Liu
Wei Tang
Jinghui Lu
Rui Zhao
Zhaojun Guo
Fei Tan
VLM
177
21
0
29 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
Jinho Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
160
7
0
01 May 2023
Enhancing object detection robustness: A synthetic and natural
  perturbation approach
Enhancing object detection robustness: A synthetic and natural perturbation approach
N. Premakumara
B. Jalaeian
N. Suri
H. Samani
130
4
0
20 Apr 2023
Probabilistic Prompt Learning for Dense Prediction
Probabilistic Prompt Learning for Dense PredictionComputer Vision and Pattern Recognition (CVPR), 2023
Hyeongjun Kwon
Taeyong Song
Somi Jeong
Jin-Hwa Kim
Jinhyun Jang
Kwanghoon Sohn
VLM
222
25
0
03 Apr 2023
Borrowing Human Senses: Comment-Aware Self-Training for Social Media
  Multimodal Classification
Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Chunpu Xu
Jing Li
VLM
92
5
0
27 Mar 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023
Tianlin Li
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CEVLM
384
259
0
20 Feb 2023
Multi-modal Machine Learning in Engineering Design: A Review and Future
  Directions
Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023
Binyang Song
Ruilin Zhou
Faez Ahmed
AI4CE
284
61
0
14 Feb 2023
A survey on knowledge-enhanced multimodal learning
A survey on knowledge-enhanced multimodal learningArtificial Intelligence Review (Artif Intell Rev), 2022
Maria Lymperaiou
Giorgos Stamou
409
19
0
19 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
  Information
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
181
54
0
17 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with
  Disentangled Multimodal-Attention
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
175
13
0
28 Oct 2022
Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in FashionMachine Intelligence Research (MIR), 2022
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Daniel Gehrig
Luc Van Gool
218
27
0
27 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak
  Supervision
Learning by Hallucinating: Vision-Language Pre-training with Weak SupervisionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Tong Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
116
6
0
24 Oct 2022
Contrastive Language-Image Pre-Training with Knowledge Graphs
Contrastive Language-Image Pre-Training with Knowledge GraphsNeural Information Processing Systems (NeurIPS), 2022
Xuran Pan
Tianzhu Ye
Dongchen Han
Qing Xiao
Gao Huang
VLMCLIP
175
62
0
17 Oct 2022
Learning to Evaluate Performance of Multi-modal Semantic Localization
Learning to Evaluate Performance of Multi-modal Semantic LocalizationIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022
Zhiqiang Yuan
Wenkai Zhang
Chongyang Li
Zhaoying Pan
Yongqiang Mao
Jialiang Chen
Shuoke Li
Hongqi Wang
Xian Sun
202
28
0
14 Sep 2022
Computational Sarcasm Analysis on Social Media: A Systematic Review
Computational Sarcasm Analysis on Social Media: A Systematic Review
Faria Binte Kader
Nafisa Hossain Nujat
Tasmia Binte Sogir
Mohsinul Kabir
H. Mahmud
Md. Kamrul Hasan
156
7
0
13 Sep 2022
PreSTU: Pre-Training for Scene-Text Understanding
PreSTU: Pre-Training for Scene-Text UnderstandingIEEE International Conference on Computer Vision (ICCV), 2022
Jihyung Kil
Soravit Changpinyo
Xi Chen
Hexiang Hu
Sebastian Goodman
Wei-Lun Chao
Radu Soricut
VLM
277
36
0
12 Sep 2022
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
A Sketch Is Worth a Thousand Words: Image Retrieval with Text and SketchEuropean Conference on Computer Vision (ECCV), 2022
Patsorn Sangkloy
Wittawat Jitkrittum
Diyi Yang
James Hays
3DV
156
40
0
05 Aug 2022
Vision-and-Language Pretraining
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLMCLIP
244
2
0
05 Jul 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
447
800
0
13 Jun 2022
PEVL: Position-enhanced Pre-training and Prompt Tuning for
  Vision-language Models
PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuan Yao
Qi-An Chen
Ao Zhang
Wei Ji
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
VLMMLLM
173
43
0
23 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
226
38
0
10 May 2022
Training and challenging models for text-guided fashion image retrieval
Training and challenging models for text-guided fashion image retrieval
Eric Dodds
Jack Culpepper
Gaurav Srivastava
124
10
0
23 Apr 2022
Towards Lightweight Transformer via Group-wise Transformation for
  Vision-and-Language Tasks
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language TasksIEEE Transactions on Image Processing (IEEE TIP), 2022
Gen Luo
Weihao Ye
Xiaoshuai Sun
Yan Wang
Liujuan Cao
Yongjian Wu
Feiyue Huang
Rongrong Ji
ViT
134
57
0
16 Apr 2022
Visual-Language Navigation Pretraining via Prompt-based Environmental
  Self-exploration
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-explorationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022
Xiwen Liang
Fengda Zhu
Lingling Li
Hang Xu
Xiaodan Liang
LM&RoVLM
108
32
0
08 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large
  Models
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TSVLM
157
40
0
03 Mar 2022
VLP: A Survey on Vision-Language Pre-training
VLP: A Survey on Vision-Language Pre-trainingMachine Intelligence Research (MIR), 2022
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
311
279
0
18 Feb 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and
  Sound
MERLOT Reserve: Neural Script Knowledge through Vision and Language and SoundComputer Vision and Pattern Recognition (CVPR), 2022
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
400
236
0
07 Jan 2022
LaTr: Layout-Aware Transformer for Scene-Text VQA
LaTr: Layout-Aware Transformer for Scene-Text VQAComputer Vision and Pattern Recognition (CVPR), 2021
Ali Furkan Biten
Ron Litman
Yusheng Xie
Srikar Appalaraju
R. Manmatha
ViT
319
113
0
23 Dec 2021
Decompose the Sounds and Pixels, Recompose the Events
Decompose the Sounds and Pixels, Recompose the EventsAAAI Conference on Artificial Intelligence (AAAI), 2021
Varshanth R. Rao
Md Ibrahim Khalil
Haoda Li
Peng Dai
Juwei Lu
121
5
0
21 Dec 2021
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
  for Zero-shot and Few-shot Tasks
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks
Xizhou Zhu
Jinguo Zhu
Hao Li
Xiaoshi Wu
Xiaogang Wang
Jiaming Song
Xiaohua Wang
Jifeng Dai
232
149
0
02 Dec 2021
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language
  Modeling
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
Zhengyuan Yang
Zhe Gan
Jianfeng Wang
Xiaowei Hu
Faisal Ahmed
Zicheng Liu
Yumao Lu
Lijuan Wang
293
131
0
23 Nov 2021
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by
  Image and Caption Generation
LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Shaikh
Zhanghexuan Ji
Dana Moukheiber
Yan Shen
S. Srihari
Mingchen Gao
VLM
132
1
0
04 Sep 2021
Audio-Visual Transformer Based Crowd Counting
Audio-Visual Transformer Based Crowd Counting
Usman Sajid
Xiangyu Chen
Hasan Sajid
Taejoon Kim
Guanghui Wang
ViT
218
24
0
04 Sep 2021
Auto-Parsing Network for Image Captioning and Visual Question Answering
Auto-Parsing Network for Image Captioning and Visual Question AnsweringIEEE International Conference on Computer Vision (ICCV), 2021
Xu Yang
Chongyang Gao
Hanwang Zhang
Jianfei Cai
197
41
0
24 Aug 2021
From Two to One: A New Scene Text Recognizer with Visual Language
  Modeling Network
From Two to One: A New Scene Text Recognizer with Visual Language Modeling NetworkIEEE International Conference on Computer Vision (ICCV), 2021
Yuxin Wang
Hongtao Xie
Shancheng Fang
Jing Wang
Shenggao Zhu
Yongdong Zhang
VLM
197
172
0
22 Aug 2021
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Airbert: In-domain Pretraining for Vision-and-Language Navigation
Pierre-Louis Guhur
Makarand Tapaswi
Shizhe Chen
Ivan Laptev
Cordelia Schmid
LM&Ro
153
163
0
20 Aug 2021
Knowledge Perceived Multi-modal Pretraining in E-commerce
Knowledge Perceived Multi-modal Pretraining in E-commerce
Yushan Zhu
Huaixiao Tou
Wen Zhang
Ganqiang Ye
Hui Chen
Ningyu Zhang
Huajun Chen
185
37
0
20 Aug 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
124
3
0
27 Jul 2021
123
Next