615

Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering

Main:6 Pages
1 Figures
Bibliography:1 Pages
1 Tables
Abstract

Medical Visual Question Answering (MedVQA) has attracted growing interest at the intersection of computer vision and natural language processing. By interpreting medical images and providing precise answers to relevant clinical inquiries, MedVQA has the potential to support diagnostic decision-making and reduce workload across various domains, particularly radiology. While recent approaches rely heavily on unified large pre-trained Visual-Language Models, research on more efficient fusion mechanisms remains relatively limited in this domain. In this paper, we introduce a novel fusion model, OMniBAN, that integrates Orthogonality loss, Multi-head attention, and a Bilinear Attention Network to achieve high computational efficiency alongside solid performance. We conduct comprehensive experiments and provide insights into how bilinear attention fusion can approximate the performance of larger fusion models like cross-modal Transformer. Our results demonstrate that OMniBAN outperforms traditional approaches on key MedVQA benchmarks while maintaining a lower computational cost. This balance between efficiency and accuracy suggests that OMniBAN could be a viable option for real-world medical image question answering, where computational resources are often constrained.

View on arXiv
Comments on this paper