Causal Debiasing for Visual Commonsense Reasoning

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

23 October 2025

ArXiv (abs)PDF HTML Github

Main:4 Pages

3 Figures

Bibliography:1 Pages

Abstract

Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.

View on arXiv

Comments on this paper