Soft Best-of-n Sampling for Model Alignment

Best-of- (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger yields a higher reward at a higher distortion cost. We introduce Soft Best-of- sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter . We establish theoretical guarantees showing that Soft Best-of- sampling converges sharply to the optimal tilted distribution at a rate of in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
View on arXiv@article{verdun2025_2505.03156, title={ Soft Best-of-n Sampling for Model Alignment }, author={ Claudio Mayrink Verdun and Alex Oesterling and Himabindu Lakkaraju and Flavio P. Calmon }, journal={arXiv preprint arXiv:2505.03156}, year={ 2025 } }