Human Attention in Visual Question Answering: Do Humans and Deep
Networks Look at the Same Regions?
We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). We find that depending on the implementation used, machine-generated attention maps are either \emph{negatively correlated} with human attention or have positive correlation worse than task-independent saliency. Overall, our experiments paint a bleak picture for the current generation of attention models in VQA.
View on arXiv