Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2409.04599
Cited By
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
6 September 2024
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training"
7 / 7 papers shown
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason
Pavel Chizhov
Ivan P. Yamshchikov
Mark Fishel
CLL
VLM
132
0
0
03 Dec 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
112
0
0
05 Nov 2025
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
Jiacheng Wu
AI4CE
92
12
0
05 Oct 2025
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung
Jeonghoon Kim
188
1
0
21 Aug 2025
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula
Sandipan Dandapat
D. Sharma
Parameswari Krishnamurthy
231
0
0
11 Aug 2025
Incorporating Domain Knowledge into Materials Tokenization
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yerim Oh
Jun-Hyung Park
Junho Kim
SungHo Kim
S. Lee
160
0
0
09 Jun 2025
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
480
30
0
27 May 2024
1