Modular Training of Neural Networks aids Interpretability

4 February 2025

Abstract

An approach to improve neural network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We define a measure for clusterability and show that pre-trained models form highly enmeshed clusters via spectral graph clustering. We thus train models to be more modular using a "clusterability loss" function that encourages the formation of non-interacting clusters. Using automated interpretability techniques, we show that our method can help train models that are more modular and learn different, disjoint, and smaller circuits. We investigate CNNs trained on MNIST and CIFAR, small transformers trained on modular addition, and language models. Our approach provides a promising direction for training neural networks that learn simpler functions and are easier to interpret.

View on arXiv

@article{golechha2025_2502.02470,
  title={ Modular Training of Neural Networks aids Interpretability },
  author={ Satvik Golechha and Maheep Chaudhary and Joan Velja and Alessandro Abate and Nandi Schoots },
  journal={arXiv preprint arXiv:2502.02470},
  year={ 2025 }
}

Comments on this paper