461

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

International Conference on Artificial Intelligence and Statistics (AISTATS), 2023
Main:14 Pages
1 Figures
Bibliography:5 Pages
Appendix:29 Pages
Abstract

In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned within language models? (b) What are suitable performance metrics to evaluate ICL accurately and what are the error rates? (c) How does the transformer architecture enable ICL? To answer (a), we take a Bayesian view and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a regret bound O(1/T)\mathcal{O}(1/T), where TT is the ICL input sequence length. To address (c), in addition to the encoded Bayesian model averaging algorithm in attention, we show that during pertaining, the total variation distance between the learned model and the nominal model is bounded by a sum of an approximation error and a generalization error of O~(1/NpTp)\tilde{\mathcal{O}}(1/\sqrt{N_{\mathrm{p}}T_{\mathrm{p}}}), where NpN_{\mathrm{p}} and TpT_{\mathrm{p}} are the number of token sequences and the length of each sequence in pretraining, respectively. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.

View on arXiv
Comments on this paper