18
26

A Formal Perspective on Byte-Pair Encoding

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1σ(μ)(1eσ(μ))\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})-approximation of an optimal merge sequence, where σ(μ){\sigma(\boldsymbol{\mu}^\star)} is the total backward curvature with respect to the optimal merge sequence μ\boldsymbol{\mu}^\star. Empirically the lower bound of the approximation is 0.37\approx 0.37. We provide a faster implementation of BPE which improves the runtime complexity from O(NM)\mathcal{O}\left(N M\right) to O(NlogM)\mathcal{O}\left(N \log M\right), where NN is the sequence length and MM is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

View on arXiv
Comments on this paper