Ask Your Neurons: A Neural-based Approach to Answering Questions about
Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an end-to-end formulation of this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (question). Our result doubles the performance of the previous best result on this problem. We provide additional insights into the problem by analyzing how much information is contained only in the language part for which we provide a new human baseline. Further annotations were collected to study human consensus, which is related to the ambiguities inherent in this challenging task.
View on arXiv