Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2305.15425
Cited By
v1
v2 (latest)
Language Model Tokenizers Introduce Unfairness Between Languages
Neural Information Processing Systems (NeurIPS), 2023
17 May 2023
Aleksandar Petrov
Emanuele La Malfa
Juil Sock
Adel Bibi
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"Language Model Tokenizers Introduce Unfairness Between Languages"
50 / 68 papers shown
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason
Pavel Chizhov
Ivan P. Yamshchikov
Mark Fishel
CLL
VLM
138
0
0
03 Dec 2025
Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses
Dong Nguyen
Laura Rosseel
69
0
0
28 Nov 2025
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone
Shubham Ugare
Gagandeep Singh
Sasa Misailovic
96
1
0
05 Nov 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
115
1
0
05 Nov 2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal
Aarush Gupta
130
0
0
31 Oct 2025
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
163
1
0
24 Oct 2025
Back to Bytes: Revisiting Tokenization Through UTF-8
Amit Moryossef
Clara Meister
Pavel Stepachev
Desmond Elliott
127
0
0
19 Oct 2025
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
Yuval Reif
Guy Kaplan
Roy Schwartz
KELM
170
0
0
19 Oct 2025
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
193
0
0
15 Oct 2025
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon
Seongtae Hong
Jaehyung Seo
Heuiseok Lim
ALM
151
0
0
09 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Mosong Ma
Tania Stathaki
Michalis Lazarou
MedIm
GAN
251
0
0
07 Oct 2025
Auditing Pay-Per-Token in Large Language Models
Ander Artola Velasco
Stratis Tsirtsis
Manuel Gomez Rodriguez
MLAU
229
0
0
05 Oct 2025
The Disparate Impacts of Speculative Decoding
Jameson Sandler
Ahmet Üstün
Marco Romanelli
Sara Hooker
Ferdinando Fioretto
120
1
0
02 Oct 2025
One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
Sualeha Farid
Jayden Lin
Zean Chen
Shivani Kumar
David Jurgens
LRM
140
1
0
25 Sep 2025
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Vani Kanjirangat
Tanja Samardžić
Ljiljana Dolamic
Fabio Rinaldi
84
1
0
24 Sep 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini
Dan Jurafsky
Christopher Potts
Martijn Bartelds
185
0
0
23 Sep 2025
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges
Sampoorna Poria
Xiaolei Huang
200
0
0
15 Sep 2025
It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
Yue Li
Zhixue Zhao
Carolina Scarton
141
3
0
26 Aug 2025
Long Chain-of-Thought Reasoning Across Languages
Josh Barua
Seun Eisape
Kayo Yin
Alane Suhr
LRM
156
1
0
20 Aug 2025
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Aamod Thakur
Ajay Nagpal
Atharva Savarkar
Kundeshwar Pundalik
Siddhesh Dosi
Piyush Sawarkar
Viraj Thakur
Rohit Saluja
Maunendra Sankar Desarkar
Ganesh Ramakrishnan
104
1
0
03 Aug 2025
AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini
Open Research Europe (ORE), 2025
Jill Walker Rettberg
Hermann Wigers
141
2
0
30 Jul 2025
SpeLLM: Character-Level Multi-Head Decoding
Amit Ben Artzy
Roy Schwartz
139
1
0
22 Jul 2025
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
A. Owodunni
Orevaoghene Ahia
Sachin Kumar
217
0
0
17 Jul 2025
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho
Yeonkyoun So
Chanwoo Park
Sangmin Lee
Sungmok Jung
Jaejin Lee
VLM
220
0
0
18 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan
Alejandro Salamanca
Andres Felipe Cruz-Salinas
Kris Cao
Hangyu Lin
Acyr Locatelli
Marzieh Fadaee
Ahmet Üstün
Sara Hooker
CLL
373
4
0
12 Jun 2025
Bit-level BPE: Below the byte boundary
Sangwhan Moon
Tatsuya Hiraoka
Naoaki Okazaki
173
1
0
09 Jun 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
276
2
0
03 Jun 2025
Minimal Pair-Based Evaluation of Code-Switching
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Igor Sterner
Simone Teufel
297
5
0
02 Jun 2025
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Ander Artola Velasco
Stratis Tsirtsis
William Orchard
Manuel Gomez Rodriguez
381
3
0
27 May 2025
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Saman Sarker Joy
Swakkhar Shatabda
ELM
184
1
0
25 May 2025
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Md. Tanzib Hosain
Rajan Das Gupta
Md. Kishor Morol
206
3
0
24 May 2025
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRM
ELM
969
28
0
08 May 2025
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Nirvan Patil
Malhar Abhay Inamdar
Agnivo Gosai
Guruprasad Pathak
Anish Joshi
Aryan Sagavekar
Anish Joshirao
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
353
1
0
07 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLM
ELM
LRM
252
3
0
31 Mar 2025
Adversarial Tokenization
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Renato Lui Geh
Zilei Shao
Karen Ullrich
SILM
AAML
439
6
0
04 Mar 2025
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Fajri Koto
Rituraj Joshi
Nurdaulet Mukhituly
Yanjie Wang
Zhuohan Xie
...
Sarath Chandran
Avraham Sheinin
Natalia Vassilieva
Neha Sengupta
Larry Murray
ALM
KELM
404
3
0
03 Mar 2025
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
293
45
0
24 Feb 2025
Tokenization is Sensitive to Language Variation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Anna Wegmann
Dong Nguyen
David Jurgens
434
5
0
21 Feb 2025
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Ting Sun
Penghan Wang
Fan Lai
322
3
0
17 Feb 2025
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Somnath Banerjee
Sayan Layek
Pratyush Chatterjee
Animesh Mukherjee
Rima Hazra
LLMSV
378
3
0
16 Feb 2025
How well can LLMs Grade Essays in Arabic?
Rayed Ghazawi
Edwin Simpson
215
5
0
27 Jan 2025
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan
Helen Treharne
Constantin Orasan
Shenbin Qian
252
6
0
08 Jan 2025
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
681
0
0
23 Nov 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
International Conference on Learning Representations (ICLR), 2024
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
416
14
0
28 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
262
3
0
17 Oct 2024
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
J. Pierrehumbert
Furu Wei
514
3
0
14 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
International Conference on Learning Representations (ICLR), 2024
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
238
8
0
12 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
International Conference on Learning Representations (ICLR), 2024
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
439
30
0
08 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
...
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Kohler
Nicolas Flores-Herr
313
14
0
30 Sep 2024
ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
International Conference on Human Factors in Computing Systems (CHI), 2024
Inhwa Song
SoHyun Park
Sachin R. Pendse
J. Schleider
Munmun De Choudhury
Young-Ho Kim
383
3
0
15 Sep 2024
1
2
Next