Tokenisation is NP-Complete

19 December 2024

Philip Whittington

Gregor Bachmann

Tiago Pimentel

ArXiv (abs)PDF HTML

Main:7 Pages

Bibliography:3 Pages

4 Tables

Appendix:11 Pages

Abstract

In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).

View on arXiv

Comments on this paper