LLMs are Bayesian, In Expectation, Not in Realization
Exchangeability-based martingale diagnostics have been used to question Bayesian explanations of transformer in-context learning. We show that these violations are compatible with Bayesian/MDL behavior once we account for a basic architectural fact: positional encodings break exchangeability. Accordingly, the relevant baseline is performance in expectation over orderings of an exchangeable multiset, not performance under every fixed ordering.In a Bernoulli microscope (under explicit regularity assumptions), we bound the permutation-induced dispersion detected by martingale diagnostics (Theorem~3.4) while proving near-optimal expected MDL/compression over permutations (Theorem~3.6). Empirically, black-box next-token log-probabilities from an Azure OpenAI deployment exhibit nonzero expectation--realization gaps that decay with context length (mean 0.74 at to 0.26 at ; 95\% confidence intervals), and permutation averaging reduces order-induced standard deviation with a trend (Figure~2).Controlled from-scratch training ablations varying only the positional encoding show within-prefix order variance collapsing to with no positional encoding, but remaining -- under standard positional encoding schemes (Table~2). Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks.
View on arXiv