Studying Small Language Models with Susceptibilities

25 April 2025

Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.

View on arXiv

@article{baker2025_2504.18274,
  title={ Studying Small Language Models with Susceptibilities },
  author={ Garrett Baker and George Wang and Jesse Hoogland and Daniel Murfet },
  journal={arXiv preprint arXiv:2504.18274},
  year={ 2025 }
}

Comments on this paper