Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

11 August 2025

Saketh Reddy Vemula

Sandipan Dandapat

D. Sharma

ArXiv (abs)PDF HTML Github (38★)

Main:9 Pages

7 Figures

Bibliography:5 Pages

15 Tables

Appendix:7 Pages

Abstract

Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models -- starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms.

View on arXiv

Comments on this paper