881

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Main:9 Pages
24 Figures
Bibliography:5 Pages
2 Tables
Appendix:21 Pages
Abstract

We propose a general two-stage algorithm that enjoys a provable scaling law for the test-time compute of large language models (LLMs). Given an input problem, the proposed algorithm first generates NN candidate solutions, and then chooses the best one via a multiple-round knockout tournament where each pair of candidates are compared for KK times and only the winners move on to the next round. In a minimalistic implementation, both stages can be executed with a black-box LLM alone and nothing else (e.g., no external verifier or reward model), and a total of N×(K+1)N \times (K + 1) highly parallelizable LLM calls are needed for solving an input problem. Assuming that a generated candidate solution is correct with probability pgen>0p_{\text{gen}} > 0 and a comparison between a pair of correct and incorrect solutions identifies the right winner with probability pcomp>0.5p_{\text{comp}} > 0.5 (i.e., better than a random guess), we prove theoretically that the failure probability of the proposed algorithm decays to zero exponentially with respect to NN and KK: P(final output is incorrect)(1pgen)N+log2Ne2K(pcomp0.5)2.\mathbb{P}(\text{final output is incorrect}) \le (1 - p_{\text{gen}})^N + \lceil \log_2 N \rceil e^{-2 K (p_{\text{comp}} - 0.5)^2}. Our empirical results with the challenging MMLU-Pro benchmark validate the technical assumptions, as well as the efficacy of the proposed algorithm and the gains from scaling up its test-time compute.

View on arXiv
Comments on this paper