ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.04599
  4. Cited By
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
6 September 2024
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training"

7 / 7 papers shown
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason
Pavel Chizhov
Ivan P. Yamshchikov
Mark Fishel
CLLVLM
132
0
0
03 Dec 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
112
0
0
05 Nov 2025
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
Jiacheng Wu
AI4CE
92
12
0
05 Oct 2025
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung
Jeonghoon Kim
188
1
0
21 Aug 2025
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula
Sandipan Dandapat
D. Sharma
Parameswari Krishnamurthy
231
0
0
11 Aug 2025
Incorporating Domain Knowledge into Materials Tokenization
Incorporating Domain Knowledge into Materials TokenizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yerim Oh
Jun-Hyung Park
Junho Kim
SungHo Kim
S. Lee
160
0
0
09 Jun 2025
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
480
30
0
27 May 2024
1