41

RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs

Main:6 Pages
3 Figures
Bibliography:4 Pages
6 Tables
Appendix:10 Pages
Abstract

Best-of-nn is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-nn relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-nn), a sequential multi-LLM alternative to the prevailing single-model best-of-nn. Given a suite of models {mi}i=1M\{m_i\}_{i=1}^M, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-nn applied to each individual model for larger nn, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-nn performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.

View on arXiv
Comments on this paper