v1v2 (latest)

LLMs are Bayesian, In Expectation, Not in Realization

15 July 2025

Leon Chlon

Zein Khamis

Maggie Chlon

Mahdi El Zein

MarcAntonio M. Awada

ArXiv (abs)PDF HTML Github

Main:7 Pages

4 Figures

Bibliography:1 Pages

5 Tables

Appendix:4 Pages

Abstract

Exchangeability-based martingale diagnostics have been used to question Bayesian explanations of transformer in-context learning. We show that these violations are compatible with Bayesian/MDL behavior once we account for a basic architectural fact: positional encodings break exchangeability. Accordingly, the relevant baseline is performance in expectation over orderings of an exchangeable multiset, not performance under every fixed ordering.In a Bernoulli microscope (under explicit regularity assumptions), we bound the permutation-induced dispersion detected by martingale diagnostics (Theorem~3.4) while proving near-optimal expected MDL/compression over permutations (Theorem~3.6). Empirically, black-box next-token log-probabilities from an Azure OpenAI deployment exhibit nonzero expectation--realization gaps that decay with context length (mean 0.74 at $n = 10$ to 0.26 at $n = 50$ ; 95\% confidence intervals), and permutation averaging reduces order-induced standard deviation with a $k^{-1/2}$ trend (Figure~2).Controlled from-scratch training ablations varying only the positional encoding show within-prefix order variance collapsing to $\approx 10^{-16}$ with no positional encoding, but remaining $10^{-8}$ -- $10^{-6}$ under standard positional encoding schemes (Table~2). Robustness checks extend beyond Bernoulli to categorical sequences, synthetic in-context learning tasks, and evidence-grounded QA with permuted exchangeable evidence chunks.

View on arXiv

Comments on this paper