v1v2 (latest)

Two Counterexamples to Tokenization and the Noiseless Channel

22 February 2024

Papers citing "Two Counterexamples to Tokenization and the Noiseless Channel"

4 / 4 papers shown

Title
Length-MAX Tokenizer for Language Models Dong Dong Weijie Su VLM 118 0 0 25 Nov 2025
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 Preston Firestone Shubham Ugare Gagandeep Singh Sasa Misailovic 80 1 0 05 Nov 2025
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder Jiacheng Wu AI4CE 76 10 0 05 Oct 2025
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment Saketh Reddy Vemula Sandipan Dandapat D. Sharma Parameswari Krishnamurthy 199 0 0 11 Aug 2025