262
v1v2 (latest)

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Main:13 Pages
11 Figures
Bibliography:6 Pages
5 Tables
Appendix:38 Pages
Abstract

Inference-time computation offers a powerful axis for scaling the performance of language models. However, naively increasing computation in techniques like Best-of-N sampling can lead to performance degradation due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment, which we formalize as the problem of improving the quality of responses drawn from a pre-trained policy, given a prompt of interest and access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling:1. We show that Best-of-NN alignment with an ideal choice for NN can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when NN is large, and fails to achieve tight guarantees under more realistic coverage conditions.2. We introduce InferenceTimePessimism\texttt{InferenceTimePessimism}, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with NN, meaning it is scaling-monotonic.We complement our theoretical results with an experimental evaluation that demonstrate the benefits of InferenceTimePessimism\texttt{InferenceTimePessimism} across a variety of tasks and models.

View on arXiv
Comments on this paper