ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.15425
  4. Cited By
Language Model Tokenizers Introduce Unfairness Between Languages
v1v2 (latest)

Language Model Tokenizers Introduce Unfairness Between Languages

Neural Information Processing Systems (NeurIPS), 2023
17 May 2023
Aleksandar Petrov
Emanuele La Malfa
Juil Sock
Adel Bibi
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Language Model Tokenizers Introduce Unfairness Between Languages"

50 / 66 papers shown
Title
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
68
0
0
05 Nov 2025
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone
Shubham Ugare
Gagandeep Singh
Sasa Misailovic
84
1
0
05 Nov 2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal
Aarush Gupta
106
0
0
31 Oct 2025
Explaining and Mitigating Crosslingual Tokenizer Inequities
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
120
0
0
24 Oct 2025
Back to Bytes: Revisiting Tokenization Through UTF-8
Back to Bytes: Revisiting Tokenization Through UTF-8
Amit Moryossef
Clara Meister
Pavel Stepachev
Desmond Elliott
79
0
0
19 Oct 2025
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic
Yuval Reif
Guy Kaplan
Roy Schwartz
KELM
137
0
0
19 Oct 2025
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
153
0
0
15 Oct 2025
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
Hyeonseok Moon
Seongtae Hong
Jaehyung Seo
Heuiseok Lim
ALM
108
0
0
09 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Mosong Ma
Tania Stathaki
Michalis Lazarou
MedImGAN
201
0
0
07 Oct 2025
Auditing Pay-Per-Token in Large Language Models
Auditing Pay-Per-Token in Large Language Models
Ander Artola Velasco
Stratis Tsirtsis
Manuel Gomez Rodriguez
MLAU
197
0
0
05 Oct 2025
The Disparate Impacts of Speculative Decoding
The Disparate Impacts of Speculative Decoding
Jameson Sandler
Ahmet Üstün
Marco Romanelli
Sara Hooker
Ferdinando Fioretto
76
0
0
02 Oct 2025
One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
One Model, Many Morals: Uncovering Cross-Linguistic Misalignments in Computational Moral Reasoning
Sualeha Farid
Jayden Lin
Zean Chen
Shivani Kumar
David Jurgens
LRM
108
1
0
25 Sep 2025
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Vani Kanjirangat
Tanja Samardžić
Ljiljana Dolamic
Fabio Rinaldi
56
0
0
24 Sep 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini
Dan Jurafsky
Christopher Potts
Martijn Bartelds
149
0
0
23 Sep 2025
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia - Current Stage and Challenges
Sampoorna Poria
Xiaolei Huang
160
0
0
15 Sep 2025
It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
Yue Li
Zhixue Zhao
Carolina Scarton
68
1
0
26 Aug 2025
Long Chain-of-Thought Reasoning Across Languages
Long Chain-of-Thought Reasoning Across Languages
Josh Barua
Seun Eisape
Kayo Yin
Alane Suhr
LRM
92
0
0
20 Aug 2025
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Aamod Thakur
Ajay Nagpal
Atharva Savarkar
Kundeshwar Pundalik
Siddhesh Dosi
Piyush Sawarkar
Viraj Thakur
Rohit Saluja
Maunendra Sankar Desarkar
Ganesh Ramakrishnan
88
1
0
03 Aug 2025
AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini
AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-miniOpen Research Europe (ORE), 2025
Jill Walker Rettberg
Hermann Wigers
98
2
0
30 Jul 2025
SpeLLM: Character-Level Multi-Head Decoding
SpeLLM: Character-Level Multi-Head Decoding
Amit Ben Artzy
Roy Schwartz
103
1
0
22 Jul 2025
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
A. Owodunni
Orevaoghene Ahia
Sachin Kumar
166
0
0
17 Jul 2025
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho
Yeonkyoun So
Chanwoo Park
Sangmin Lee
Sungmok Jung
Jaejin Lee
VLM
166
0
0
18 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan
Alejandro Salamanca
Andres Felipe Cruz-Salinas
Kris Cao
Hangyu Lin
Acyr Locatelli
Marzieh Fadaee
Ahmet Üstün
Sara Hooker
CLL
321
3
0
12 Jun 2025
Bit-level BPE: Below the byte boundary
Bit-level BPE: Below the byte boundary
Sangwhan Moon
Tatsuya Hiraoka
Naoaki Okazaki
145
0
0
09 Jun 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
Beyond Text Compression: Evaluating Tokenizers Across ScalesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
247
2
0
03 Jun 2025
Minimal Pair-Based Evaluation of Code-Switching
Minimal Pair-Based Evaluation of Code-SwitchingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Igor Sterner
Simone Teufel
237
4
0
02 Jun 2025
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
Ander Artola Velasco
Stratis Tsirtsis
William Orchard
Manuel Gomez Rodriguez
308
2
0
27 May 2025
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Saman Sarker Joy
ELM
139
1
0
25 May 2025
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Md. Tanzib Hosain
Rajan Das Gupta
Md. Kishor Morol
154
2
0
24 May 2025
Crosslingual Reasoning through Test-Time Scaling
Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong
Muhammad Farid Adilazuarda
Jonibek Mansurov
Ruochen Zhang
Niklas Muennighoff
Carsten Eickhoff
Genta Indra Winata
Julia Kreutzer
Stephen H. Bach
Alham Fikri Aji
LRMELM
925
26
0
08 May 2025
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance
Nirvan Patil
Malhar Abhay Inamdar
Agnivo Gosai
Guruprasad Pathak
Anish Joshi
Aryan Sagavekar
Anish Joshirao
Raj Abhijit Dandekar
Rajat Dandekar
Sreedath Panat
309
1
0
07 Apr 2025
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities
Roussel Rahman
ReLMELMLRM
189
3
0
31 Mar 2025
Adversarial Tokenization
Adversarial TokenizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Renato Lui Geh
Zilei Shao
Karen Ullrich
SILMAAML
346
5
0
04 Mar 2025
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Fajri Koto
Rituraj Joshi
Nurdaulet Mukhituly
Yanjie Wang
Zhuohan Xie
...
Sarath Chandran
Avraham Sheinin
Natalia Vassilieva
Neha Sengupta
Larry Murray
ALMKELM
341
3
0
03 Mar 2025
Do Multilingual LLMs Think In English?
Do Multilingual LLMs Think In English?
Lisa Schut
Y. Gal
Sebastian Farquhar
224
42
0
24 Feb 2025
Tokenization is Sensitive to Language Variation
Tokenization is Sensitive to Language VariationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Anna Wegmann
Dong Nguyen
David Jurgens
370
4
0
21 Feb 2025
DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
DiSCo: Device-Server Collaborative LLM-Based Text Streaming ServicesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ting Sun
Penghan Wang
Fan Lai
244
3
0
17 Feb 2025
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment
Somnath Banerjee
Sayan Layek
Pratyush Chatterjee
Animesh Mukherjee
Rima Hazra
LLMSV
314
3
0
16 Feb 2025
How well can LLMs Grade Essays in Arabic?
How well can LLMs Grade Essays in Arabic?
Rayed Ghazawi
Edwin Simpson
179
4
0
27 Jan 2025
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan
Helen Treharne
Constantin Orasan
Shenbin Qian
219
6
0
08 Jan 2025
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
581
0
0
23 Nov 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
MrT5: Dynamic Token Merging for Efficient Byte-level Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
367
14
0
28 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language
  Models
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
230
3
0
17 Oct 2024
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Fangru Lin
Shaoguang Mao
Emanuele La Malfa
Valentin Hofmann
Adrian de Wynter
Jing Yao
Si-Qing Chen
Michael Wooldridge
J. Pierrehumbert
Furu Wei
411
3
0
14 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?International Conference on Learning Representations (ICLR), 2024
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
207
8
0
12 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
From Tokens to Words: On the Inner Lexicon of LLMsInternational Conference on Learning Representations (ICLR), 2024
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
344
28
0
08 Oct 2024
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Jan Ebert
Alexander Arno Weber
...
René Jäkel
Georg Rehm
Stefan Kesselheim
Joachim Kohler
Nicolas Flores-Herr
249
13
0
30 Sep 2024
ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language Models
ExploreSelf: Fostering User-driven Exploration and Reflection on Personal Challenges with Adaptive Guidance by Large Language ModelsInternational Conference on Human Factors in Computing Systems (CHI), 2024
Inhwa Song
SoHyun Park
Sachin R. Pendse
J. Schleider
Munmun De Choudhury
Young-Ho Kim
293
2
0
15 Sep 2024
Where is the signal in tokenization space?
Where is the signal in tokenization space?Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Renato Lui Geh
Honghua Zhang
Kareem Ahmed
Benjie Wang
Karen Ullrich
245
11
0
16 Aug 2024
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary
  and Instruction Capabilities
Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
Shaltiel Shmidman
Avi Shmidman
Amir DN Cohen
Moshe Koppel
161
9
0
09 Jul 2024
12
Next