ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2510.15674
65
0

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

17 October 2025
Yung-Chen Tang
Pin-Yu Chen
Andrea Cavallaro
    LRM
ArXiv (abs)PDFHTML
Main:9 Pages
6 Figures
Bibliography:2 Pages
12 Tables
Appendix:10 Pages
Abstract

Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-NNN sampling often show diminishing returns as NNN increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-NNN), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature TTT and additive shift vector δ\deltaδ, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to 4×4\times4× fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of TTT and δ\deltaδ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page atthis http URL.

View on arXiv
Comments on this paper