ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.08754
  4. Cited By
Tokenizer Choice For LLM Training: Negligible or Crucial?
v1v2v3v4 (latest)

Tokenizer Choice For LLM Training: Negligible or Crucial?

12 October 2023
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
Johannes Leveling
Katrin Klug
Jan Ebert
Niclas Doll
Jasper Schulze Buschhoff
Charvi Jain
Alexander Arno Weber
Lena Jurkschat
Hammam Abdelwahab
Chelsea John
Pedro Ortiz Suarez
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "Tokenizer Choice For LLM Training: Negligible or Crucial?"

46 / 46 papers shown
On the Origin of Algorithmic Progress in AI
On the Origin of Algorithmic Progress in AI
Hans Gundlach
Alex Fogelson
Jayson Lynch
Ana Trisovic
Jonathan Rosenfeld
Anmol Sandhu
Neil Thompson
85
0
0
26 Nov 2025
Length-MAX Tokenizer for Language Models
Length-MAX Tokenizer for Language Models
Dong Dong
Weijie Su
VLM
191
0
0
25 Nov 2025
Tokenisation over Bounded Alphabets is Hard
Tokenisation over Bounded Alphabets is Hard
Violeta Kastreva
Philip Whittington
Dennis Komm
Tiago Pimentel
144
0
0
19 Nov 2025
Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter
Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter
Sajed Jalil
Shuvo Saha
Hossain Mohammad Seym
76
0
0
16 Nov 2025
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone
Shubham Ugare
Gagandeep Singh
Sasa Misailovic
96
1
0
05 Nov 2025
How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis
Ahmed Mostafa
Raisul Arefin Nahid
Samuel Mulder
100
2
0
05 Nov 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
113
0
0
05 Nov 2025
Model-Aware Tokenizer Transfer
Model-Aware Tokenizer Transfer
Mykola Haltiuk
Aleksander Smywiński-Pohl
113
0
0
24 Oct 2025
See the Text: From Tokenization to Visual Reading
See the Text: From Tokenization to Visual Reading
Ling Xing
Alex Jinpeng Wang
Rui Yan
Hongyu Qu
Zechao Li
Jinhui Tang
VLM
156
0
0
21 Oct 2025
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Sriharsh Bhyravajjula
Melanie Walsh
Anna Preus
Maria Antoniak
113
0
0
19 Oct 2025
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
193
0
0
15 Oct 2025
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
Mir Tafseer Nayeem
Sawsan Alqahtani
Md Tahmid Rahman Laskar
Tasnim Mohiuddin
M Saiful Bari
132
0
0
11 Oct 2025
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
Jiacheng Wu
AI4CE
92
12
0
05 Oct 2025
Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha
Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha
Tandin Wangchuk
Tad Gonsalves
84
0
0
18 Sep 2025
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian
Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian
Michael Hoffmann
Jophin John
Stefan Schweter
Gokul Ramakrishnan
Hoi-Fong Mak
Alice Zhang
Dmitry Gaynullin
Nicolay J. Hammer
CLL
160
1
0
06 Sep 2025
SEA-BED: Southeast Asia Embedding Benchmark
SEA-BED: Southeast Asia Embedding Benchmark
Wuttikorn Ponwitayarat
Raymond Ng
Jann Railey Montalan
Thura Aung
Jian Gang Ngui
...
Panuthep Tasawong
Erik Cambria
Ekapol Chuangsuwanich
Sarana Nutanong
Peerat Limkonchotiwat
162
1
0
17 Aug 2025
UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?
UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?
Mukund Choudhary
KV Aditya Srivatsa
Gaurja Aeron
Antara Raaghavi Bhattacharya
Dang Khoa Dang Dinh
Ikhlasul Akmal Hanif
Daria Kotova
Ekaterina Kochmar
Monojit Choudhury
LRM
1.5K
0
0
15 Aug 2025
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Negar Foroutan
Clara Meister
Debjit Paul
Joel Niklaus
Sina Ahmadi
Antoine Bosselut
Rico Sennrich
208
3
0
06 Aug 2025
On LLM-Assisted Generation of Smart Contracts from Business Processes
On LLM-Assisted Generation of Smart Contracts from Business Processes
Fabian Stiehle
Hans Weytjens
Ingo Weber
177
0
0
30 Jul 2025
Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
Yanjin He
Qingkai Zeng
Meng Jiang
172
1
0
30 Jul 2025
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
A. Owodunni
Orevaoghene Ahia
Sachin Kumar
214
0
0
17 Jul 2025
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho
Yeonkyoun So
Chanwoo Park
Sangmin Lee
Sungmok Jung
Jaejin Lee
VLM
216
0
0
18 Jun 2025
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin Videau
Badr Youbi Idrissi
Alessandro Leite
Marc Schoenauer
O. Teytaud
David Lopez-Paz
186
4
0
17 Jun 2025
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning
Yang Zhang
Amr Mohamed
Hadi Abdine
Guokan Shang
Michalis Vazirgiannis
208
5
0
12 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan
Alejandro Salamanca
Andres Felipe Cruz-Salinas
Kris Cao
Hangyu Lin
Acyr Locatelli
Marzieh Fadaee
Ahmet Üstün
Sara Hooker
CLL
369
3
0
12 Jun 2025
Causal Estimation of Tokenisation Bias
Causal Estimation of Tokenisation BiasAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Pietro Lesci
Clara Meister
Thomas Hofmann
Andreas Vlachos
Tiago Pimentel
240
6
0
03 Jun 2025
Token Distillation: Attention-aware Input Embeddings For New Tokens
Token Distillation: Attention-aware Input Embeddings For New Tokens
Konstantin Dobler
Desmond Elliott
Gerard de Melo
VLM
411
1
0
26 May 2025
Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Optimized Text Embedding Models and Benchmarks for Amharic Passage RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Kidist Amde Mekonnen
Yosef Worku Alemneh
Maarten de Rijke
RALM
313
2
0
25 May 2025
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns
Letitia Parcalabescu
Stephan Wäldchen
Michael Barlow
Gregor Ziegltrum
Volker Stampa
Bastian Harren
Björn Deiseroth
SyDa
498
1
0
24 Apr 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
282
3
0
21 Apr 2025
Tokenization is Sensitive to Language Variation
Tokenization is Sensitive to Language VariationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Anna Wegmann
Dong Nguyen
David Jurgens
434
5
0
21 Feb 2025
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Somnath Banerjee
Sayan Layek
Pratyush Chatterjee
Animesh Mukherjee
Rima Hazra
LLMSV
378
3
0
16 Feb 2025
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
678
0
0
23 Nov 2024
LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from ScratchAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Jan Pfister
Julia Wunderle
Andreas Hotho
550
2
0
17 Nov 2024
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla
Deeparghya Dutta Barua
Md Sakib Ul Rahman Sourove
Md Fahim
Fabiha Haider
Fariha Tanjim Shifat
Md Tasmim Rahman Adib
Anam Borhan Uddin
Md Farhan Ishmam
Md Farhad Alam
223
1
0
19 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
458
3
0
11 Oct 2024
Large Language Models as Markov Chains
Large Language Models as Markov Chains
Oussama Zekri
Ambroise Odonnat
Khyati Khandelwal
Linus Bleistein
Nicolas Boullé
I. Redko
423
26
0
03 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
...
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Kohler
Nicolas Flores-Herr
309
13
0
30 Sep 2024
Tokenization for Molecular Foundation Models
Tokenization for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
1.1K
1
0
19 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer TrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
232
14
0
06 Sep 2024
Bilingual Adaptation of Monolingual Foundation Models
Bilingual Adaptation of Monolingual Foundation Models
Gurpreet Gosal
Yishi Xu
Gokul Ramakrishnan
Rituraj Joshi
Avraham Sheinin
...
Rahul Pal
Parvez Mullah
Soundar Doraiswamy
Mohamed El Karim Chami
Preslav Nakov
CLL
352
6
0
13 Jul 2024
Training LLMs over Neurally Compressed Text
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
205
11
0
04 Apr 2024
Poro 34B and the Blessing of Multilinguality
Poro 34B and the Blessing of Multilinguality
Risto Luukkonen
Jonathan Burdge
Elaine Zosa
Aarne Talman
Ville Komulainen
Vaino Hatanpaa
Peter Sarlin
S. Pyysalo
AI4CE
311
18
0
02 Apr 2024
An Improved Traditional Chinese Evaluation Suite for Foundation Model
An Improved Traditional Chinese Evaluation Suite for Foundation Model
Zhi Rui Tam
Ya-Ting Pai
Yen-Wei Lee
Jun-Da Chen
Wei-Min Chu
Sega Cheng
Hong-Han Shuai
ELM
483
15
0
04 Mar 2024
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso
David Samuel Setiawan
Steven Limcorn
Ananto Joyoadikusumo
170
6
0
04 Mar 2024
On the Challenges and Opportunities in Generative AI
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Kushagra Pandey
Robert Bamler
Sina Daubener
...
Yixin Wang
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
756
40
0
28 Feb 2024
1