54
6

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Abstract

Test-time scaling improves large language model performance by adding extra compute during decoding. Best-of-N (BoN) sampling serves as a common scaling technique, broadening the search space for finding better solutions from the model distribution. However, traditional BoN requires N full generations, leading to high GPU memory overhead and time latency. Moreover, some methods depend on reward models, adding computational cost and limiting domain generalization.In this paper, we propose Self-Truncation Best-of-N (ST-BoN), a novel decoding method that avoids fully generating all samplings and eliminates the need for reward models. ST-BoN introduces early sampling consistency to estimate the most promising sample, truncating suboptimal ones to free memory and accelerate inference. This pushes the sampling-efficient test-time scaling. Compared to traditional BoN, ST-BoN can reduce dynamic GPU memory overhead by over 90% and time latency by 50%, while achieving comparable or even better performance across reasoning and open-ended domains.

View on arXiv
@article{wang2025_2503.01422,
  title={ Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding },
  author={ Yiming Wang and Pei Zhang and Siyuan Huang and Baosong Yang and Zhuosheng Zhang and Fei Huang and Rui Wang },
  journal={arXiv preprint arXiv:2503.01422},
  year={ 2025 }
}
Comments on this paper