Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies

19 December 2023

Papers citing "Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies"

9 / 9 papers shown

Title
Agree to Disagree? A Meta-Evaluation of LLM Misgendering Arjun Subramonian Vagrant Gautam Preethi Seshadri Dietrich Klakow Kai-Wei Chang Yizhou Sun 27 1 0 23 Apr 2025
A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications Sunayana Sitaram Adrian de Wynter Isobel McCrum Qilong Gu Si-Qing Chen AILaw 104 0 0 26 Mar 2025
Adversarial Tokenization Renato Lui Geh Zilei Shao Guy Van den Broeck SILM AAML 87 0 0 04 Mar 2025
Robust Bias Detection in MLMs and its Application to Human Trait Ratings Ingroj Shrestha Louis Tay Padmini Srinivasan 78 0 0 24 Feb 2025
Where is the signal in tokenization space? Renato Lui Geh Honghua Zhang Kareem Ahmed Benjie Wang Guy Van den Broeck 25 4 0 16 Aug 2024
Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased? Vagrant Gautam Eileen Bingert D. Zhu Anne Lauscher Dietrich Klakow 43 8 0 04 Apr 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods Omri Uzan Craig W. Schmidt Chris Tanner Yuval Pinter 38 14 0 02 Mar 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 248 1,986 0 31 Dec 2020
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 69 235 0 31 Dec 2020