Asymptotics of Language Model Alignment

Let denote a generative language model. Let denote a reward model that returns a scalar that captures the degree at which a draw from is preferred. The goal of language model alignment is to alter to a new distribution that results in a higher expected reward while keeping close to A popular alignment method is the KL-constrained reinforcement learning (RL), which chooses a distribution that maximizes subject to a relative entropy constraint Another simple alignment method is best-of-, where samples are drawn from and one with highest reward is selected. In this paper, we offer a closed-form characterization of the optimal KL-constrained RL solution. We demonstrate that any alignment method that achieves a comparable trade-off between KL divergence and reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. To further analyze the properties of alignment methods, we introduce two simplifying assumptions: we let the language model be memoryless, and the reward model be linear. Although these assumptions may not reflect complex real-world scenarios, they enable a precise characterization of the asymptotic behavior of both the best-of- alignment, and the KL-constrained RL method, in terms of information-theoretic quantities. We prove that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and we fully characterize its rate function. We also show that the rate of growth of the scaled cumulants of the reward is characterized by a proper Renyi cross entropy. Finally, we show that best-of- is asymptotically equivalent to KL-constrained RL solution by proving that their expected rewards are asymptotically equal, and concluding that the two distributions must be close in KL divergence.
View on arXiv