Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2305.13707
Cited By
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
23 May 2023
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Jungo Kasai
David R. Mortensen
Noah A. Smith
Yulia Tsvetkov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models"
23 / 73 papers shown
Title
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
54
17
0
28 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
19
6
0
16 Feb 2024
Natural Language Processing for Dialects of a Language: A Survey
Aditya Joshi
Raj Dabre
Diptesh Kanojia
Zhuang Li
Haolan Zhan
Gholamreza Haffari
Doris Dippold
LM&MA
26
27
0
11 Jan 2024
A Material Lens on Coloniality in NLP
William B. Held
Camille Harris
Michael Best
Diyi Yang
12
11
0
14 Nov 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
13
47
0
12 Oct 2023
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Siyang Liu
Naihao Deng
Sahand Sabour
Yilin Jia
Minlie Huang
Rada Mihalcea
20
18
0
09 Oct 2023
Language Models as a Service: Overview of a New Paradigm and its Challenges
Emanuele La Malfa
Aleksandar Petrov
Simon Frieder
Christoph Weinhuber
Ryan Burnell
Raza Nazar
Anthony Cohn
Nigel Shadbolt
Michael Wooldridge
ALM
ELM
27
3
0
28 Sep 2023
GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran
François Yvon
Hinrich Schütze
11
10
0
23 Sep 2023
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages
Nathaniel R. Robinson
Perez Ogayo
David R. Mortensen
Graham Neubig
11
29
0
14 Sep 2023
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song
Catherine Cui
Simran Khanuja
Pengfei Liu
Fahim Faisal
...
Alham Fikri Aji
Samuel Cahyawijaya
Yulia Tsvetkov
Antonios Anastasopoulos
Graham Neubig
12
7
0
24 May 2023
Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer
Elizabeth Salesky
Neha Verma
Philipp Koehn
Matt Post
11
14
0
23 May 2023
Language Model Tokenizers Introduce Unfairness Between Languages
Aleksandar Petrov
Emanuele La Malfa
Philip H. S. Torr
Adel Bibi
16
96
0
17 May 2023
The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research
Mohamed Abdalla
Jan Philip Wahle
Terry Ruas
Aurélie Névéol
Fanny Ducel
Saif M. Mohammad
Karën Fort
69
27
0
04 May 2023
MEGA: Multilingual Evaluation of Generative AI
Kabir Ahuja
Harshita Diddee
Rishav Hada
Millicent Ochieng
Krithika Ramesh
...
T. Ganu
Sameer Segal
Maxamed Axmed
Kalika Bali
Sunayana Sitaram
LM&MA
LRM
ELM
13
262
0
22 Mar 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Evaluating Persian Tokenizers
Danial Kamali
Behrooz Janfada
Mohammad Ebrahim Shenasa
B. Minaei-Bidgoli
6
1
0
22 Feb 2022
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan
Dallas Card
Sarah K. Drier
E. K. Gade
Leroy Z. Wang
Zeyu Wang
Luke Zettlemoyer
Noah A. Smith
160
72
0
25 Jan 2022
Low Frequency Names Exhibit Bias and Overfitting in Contextualizing Language Models
Robert Wolfe
Aylin Caliskan
82
51
0
01 Oct 2021
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press
Noah A. Smith
M. Lewis
237
690
0
27 Aug 2021
Disembodied Machine Learning: On the Illusion of Objectivity in NLP
Zeerak Talat
Smarika Lulz
Joachim Bingel
Isabelle Augenstein
88
51
0
28 Jan 2021
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
69
235
0
31 Dec 2020
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
Benjamin Muller
Antonis Anastasopoulos
Benoît Sagot
Djamé Seddah
LRM
109
165
0
24 Oct 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
58
65
0
24 Oct 2020
Previous
1
2