Tokenisation is NP-Complete
Main:7 Pages
Bibliography:3 Pages
4 Tables
Appendix:11 Pages
Abstract
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
View on arXivComments on this paper
