Studying Small Language Models with Susceptibilities

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.
View on arXiv@article{baker2025_2504.18274, title={ Studying Small Language Models with Susceptibilities }, author={ Garrett Baker and George Wang and Jesse Hoogland and Daniel Murfet }, journal={arXiv preprint arXiv:2504.18274}, year={ 2025 } }