Foundations of Top- $k$ Decoding For Language Models

25 May 2025

Main:10 Pages

9 Figures

4 Tables

Appendix:28 Pages

Abstract

Top- $k$ decoding is a widely used method for sampling from LLMs: at each token, only the largest $k$ next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top- $k$ and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top- $k$ decoding is missing. In this work, we develop a theoretical framework that both explains and generalizes top- $k$ decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We consider \emph{Bregman decoders} obtained by minimizing a separable Bregman divergence (for both the \emph{primal} and \emph{dual} cases) with a sparsity-inducing $\ell_0$ regularization. Despite the combinatorial nature of the objective, we show how to optimize it efficiently for a large class of divergences. We show that the optimal decoding strategies are greedy, and further that the loss function is discretely convex in $k$ , so that binary search provably and efficiently finds the optimal $k$ . We show that top- $k$ decoding arises as a special case for the KL divergence, and identify new decoding strategies that have distinct behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).

View on arXiv

@article{noarov2025_2505.19371,
  title={ Foundations of Top-$k$ Decoding For Language Models },
  author={ Georgy Noarov and Soham Mallick and Tao Wang and Sunay Joshi and Yan Sun and Yangxinyu Xie and Mengxin Yu and Edgar Dobriban },
  journal={arXiv preprint arXiv:2505.19371},
  year={ 2025 }
}

Comments on this paper

Foundations of Top-kkk Decoding For Language Models

Foundations of Top- $k$ Decoding For Language Models