All Papers
Title |
|---|
Title |
|---|

Transformers are widely used to extract complex semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor . Given embedding vector at sequence position in a sequence (or context) , extracting the mean effects yields the decomposition \[ \boldsymbol{h}_{c,t} = \boldsymbol{\mu} + \mathbf{pos}_t + \mathbf{ctx}_c + \mathbf{resid}_{c,t} \] where is the global mean vector, and are the mean vectors across contexts and across positions respectively, and is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) forms a low-dimensional, continuous, and often spiral shape across layers, (2) shows clear cluster structure that falls into context topics, and (3) and are mutually incoherent -- namely is almost orthogonal to -- which is canonical in compressed sensing and dictionary learning. This decomposition offers structural insights about input formats in in-context learning (especially for induction heads) and in arithmetic tasks.
View on arXiv