21
6

Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh
Abstract

Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy with respect to an ff-divergence such as the KL\mathsf{KL} divergence. The best of nn alignment policy selects a sample from the reference policy that has the maximum reward among nn independent samples. For both cases (policy alignment and best of nn), recent works showed empirically that the reward improvement of the aligned policy on the reference one scales like KL\sqrt{\mathsf{KL}}, with an explicit bound in nn on the KL\mathsf{KL} for the best of nn policy. We show in this paper that the KL\sqrt{\mathsf{KL}} information theoretic upper bound holds if the reward under the reference policy has sub-gaussian tails. Moreover, we prove for the best of nn policy, that the KL\mathsf{KL} upper bound can be obtained for any ff-divergence via a reduction to exponential order statistics owing to the R\ényi representation of order statistics, and a data processing inequality. If additional information is known on the tails of the aligned policy we show that tighter control on the reward improvement can be obtained via the R\ényi divergence. Finally we demonstrate how these upper bounds transfer from proxy rewards to golden rewards which results in a decrease in the golden reward improvement due to overestimation and approximation errors of the proxy reward.

View on arXiv
Comments on this paper