Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.10508
Cited By
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
20 December 2021
Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
Matthias Gallé
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
Samson Tan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP"
17 / 17 papers shown
Title
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
62
1
0
25 Apr 2025
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
Noel Elias
H. Esfahanizadeh
Kaan Kale
S. Vishwanath
Muriel Médard
19
0
0
28 Oct 2024
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
20
0
0
19 Sep 2024
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLM
OCL
38
6
0
22 Feb 2024
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
9
17
0
20 Oct 2023
What is the best recipe for character-level encoder-only modelling?
Kris Cao
12
2
0
09 May 2023
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
23
6
0
13 Apr 2023
Why don't people use character-level machine translation?
Jindrich Libovický
Helmut Schmid
Alexander M. Fraser
59
28
0
15 Oct 2021
How BPE Affects Memorization in Transformers
Eugene Kharitonov
Marco Baroni
Dieuwke Hupkes
155
31
0
06 Oct 2021
Integrating Approaches to Word Representation
Yuval Pinter
NAI
30
5
0
10 Sep 2021
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
69
235
0
31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
52
56
0
24 Oct 2020
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar
Bryan McCann
Tong Niu
Nazneen Rajani
N. Keskar
Thamar Solorio
15
11
0
24 Oct 2020
Towards End-to-End In-Image Neural Machine Translation
Elman Mansimov
Mitchell Stern
M. Chen
Orhan Firat
Jakob Uszkoreit
Puneet Jain
14
22
0
20 Oct 2020
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri
Olivier Ferret
Thomas Lavergne
Hiroshi Noji
Pierre Zweigenbaum
Junichi Tsujii
63
155
0
20 Oct 2020
Word Shape Matters: Robust Machine Translation with Visual Embedding
Haohan Wang
Peiyan Zhang
Eric P. Xing
119
13
0
20 Oct 2020
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
Stig-Arne Gronroos
Sami Virpioja
M. Kurimo
VLM
11
21
0
06 Mar 2020
1