v1v2v3 (latest)

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

31 January 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Main:8 Pages

1 Figures

Bibliography:6 Pages

3 Tables

Appendix:11 Pages

Abstract

The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.

View on arXiv

Comments on this paper