ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.10508
  4. Cited By
Between words and characters: A Brief History of Open-Vocabulary
  Modeling and Tokenization in NLP

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

20 December 2021
Sabrina J. Mielke
Zaid Alyafeai
Elizabeth Salesky
Colin Raffel
Manan Dey
Matthias Gallé
Arun Raja
Chenglei Si
Wilson Y. Lee
Benoît Sagot
Samson Tan
ArXivPDFHTML

Papers citing "Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP"

22 / 22 papers shown
Title
Modes of Sequence Models and Learning Coefficients
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
68
1
0
25 Apr 2025
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression
Noel Elias
H. Esfahanizadeh
Kaan Kale
S. Vishwanath
Muriel Médard
19
0
0
28 Oct 2024
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
20
0
0
19 Sep 2024
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts
Yingfa Chen
Chenlong Hu
Cong Feng
Chenyang Song
Shi Yu
Xu Han
Zhiyuan Liu
Maosong Sun
16
0
0
02 Sep 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Marco Cognetta
Tatsuya Hiraoka
Naoaki Okazaki
Rico Sennrich
Yuval Pinter
19
2
0
30 Mar 2024
Subobject-level Image Tokenization
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLM
OCL
38
6
0
22 Feb 2024
Analyzing Cognitive Plausibility of Subword Tokenization
Analyzing Cognitive Plausibility of Subword Tokenization
Lisa Beinborn
Yuval Pinter
11
17
0
20 Oct 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model
  Pretraining on Source Code
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova
Sergey Troshin
8
8
0
01 Aug 2023
Language Model Tokenizers Introduce Unfairness Between Languages
Language Model Tokenizers Introduce Unfairness Between Languages
Aleksandar Petrov
Emanuele La Malfa
Philip H. S. Torr
Adel Bibi
14
96
0
17 May 2023
What is the best recipe for character-level encoder-only modelling?
What is the best recipe for character-level encoder-only modelling?
Kris Cao
12
2
0
09 May 2023
Computational modeling of semantic change
Computational modeling of semantic change
Nina Tahmasebi
Haim Dubossarsky
26
6
0
13 Apr 2023
Word-order typology in Multilingual BERT: A case study in
  subordinate-clause detection
Word-order typology in Multilingual BERT: A case study in subordinate-clause detection
Dmitry Nikolaev
Sebastian Padó
14
6
0
24 May 2022
Why don't people use character-level machine translation?
Why don't people use character-level machine translation?
Jindrich Libovický
Helmut Schmid
Alexander M. Fraser
59
28
0
15 Oct 2021
How BPE Affects Memorization in Transformers
How BPE Affects Memorization in Transformers
Eugene Kharitonov
Marco Baroni
Dieuwke Hupkes
155
31
0
06 Oct 2021
Integrating Approaches to Word Representation
Integrating Approaches to Word Representation
Yuval Pinter
NAI
32
5
0
10 Sep 2021
How Good is Your Tokenizer? On the Monolingual Performance of
  Multilingual Language Models
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust
Jonas Pfeiffer
Ivan Vulić
Sebastian Ruder
Iryna Gurevych
69
235
0
31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung
Dan Garrette
Kiat Chuan Tan
Jason Riesa
VLM
55
56
0
24 Oct 2020
Char2Subword: Extending the Subword Embedding Space Using Robust
  Character Compositionality
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar
Bryan McCann
Tong Niu
Nazneen Rajani
N. Keskar
Thamar Solorio
17
11
0
24 Oct 2020
Towards End-to-End In-Image Neural Machine Translation
Towards End-to-End In-Image Neural Machine Translation
Elman Mansimov
Mitchell Stern
M. Chen
Orhan Firat
Jakob Uszkoreit
Puneet Jain
14
22
0
20 Oct 2020
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary
  Representations From Characters
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri
Olivier Ferret
Thomas Lavergne
Hiroshi Noji
Pierre Zweigenbaum
Junichi Tsujii
63
155
0
20 Oct 2020
Word Shape Matters: Robust Machine Translation with Visual Embedding
Word Shape Matters: Robust Machine Translation with Visual Embedding
Haohan Wang
Peiyan Zhang
Eric P. Xing
119
13
0
20 Oct 2020
Morfessor EM+Prune: Improved Subword Segmentation with Expectation
  Maximization and Pruning
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
Stig-Arne Gronroos
Sami Virpioja
M. Kurimo
VLM
11
21
0
06 Mar 2020
1