36
0

Foundations of Top-kk Decoding For Language Models

Main:10 Pages
9 Figures
4 Tables
Appendix:28 Pages
Abstract

Top-kk decoding is a widely used method for sampling from LLMs: at each token, only the largest kk next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top-kk and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top-kk decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top-kk decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing 0\ell_0 regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in kk, so that binary search provably and efficiently finds the optimal kk. We show that top-kk decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

View on arXiv
@article{noarov2025_2505.19371,
  title={ Foundations of Top-$k$ Decoding For Language Models },
  author={ Georgy Noarov and Soham Mallick and Tao Wang and Sunay Joshi and Yan Sun and Yangxinyu Xie and Mengxin Yu and Edgar Dobriban },
  journal={arXiv preprint arXiv:2505.19371},
  year={ 2025 }
}
Comments on this paper