Gradient-based Counterfactual Explanations using Tractable Probabilistic Models

Counterfactual examples are an appealing class of post-hoc explanations for machine learning models. Given input of class , its counterfactual is a contrastive example of another class . Current approaches primarily solve this task by a complex optimization: define an objective function based on the loss of the counterfactual outcome with hard or soft constraints, then optimize this function as a black-box. This "deep learning" approach, however, is rather slow, sometimes tricky, and may result in unrealistic counterfactual examples. In this work, we propose a novel approach to deal with these problems using only two gradient computations based on tractable probabilistic models. First, we compute an unconstrained counterfactual of to induce the counterfactual outcome . Then, we adapt to higher density regions, resulting in . Empirical evidence demonstrates the dominant advantages of our approach.
View on arXiv