Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2103.06874
Cited By
v1
v2
v3
v4 (latest)
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Transactions of the Association for Computational Linguistics (TACL), 2021
11 March 2021
J. Clark
Dan Garrette
Iulia Turc
John Wieting
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"
50 / 167 papers shown
Title
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone
Shubham Ugare
Gagandeep Singh
Sasa Misailovic
0
1
0
05 Nov 2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal
Aarush Gupta
36
0
0
31 Oct 2025
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
40
0
0
24 Oct 2025
From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
Rares Dolga
Lucas Maystre
Tudor Berariu
David Barber
24
0
0
17 Oct 2025
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
Huiyin Xue
Nafise Sadat Moosavi
Nikolaos Aletras
40
0
0
13 Oct 2025
Quick-CapsNet (QCN): A fast alternative to Capsule Networks
ACS/IEEE International Conference on Computer Systems and Applications (AICCSA), 2020
Pouya Shiri
Ramin Sharifi
A. Baniasadi
3DPC
92
0
0
08 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Mosong Ma
Tania Stathaki
Michalis Lazarou
MedIm
GAN
113
0
0
07 Oct 2025
LongTail-Swap: benchmarking language models' abilities on rare words
Robin Algayres
Charles-Éric Saint-James
Mahi Luthra
Jiayi Shen
Dongyan Lin
Youssef Benchekroun
Rashel Moritz
Juan Pino
Emmanuel Dupoux
52
0
0
05 Oct 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini
Dan Jurafsky
Christopher Potts
Martijn Bartelds
73
0
0
23 Sep 2025
chDzDT: Word-level morphology-aware language model for Algerian social media text
Abdelkrime Aries
36
0
0
01 Sep 2025
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung
Jeonghoon Kim
80
0
0
21 Aug 2025
Quo Vadis Handwritten Text Generation for Handwritten Text Recognition?
Vittorio Pippi
Konstantina Nikolaidou
S. Cascianelli
George Retsinas
Giorgos Sfikas
Rita Cucchiara
Marcus Liwicki
DiffM
67
0
0
13 Aug 2025
DeCAL Tokenwise Compression
Sameer Panwar
64
0
0
11 Aug 2025
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Mehrdad Zakershahrak
Samira Ghodratnama
VLM
32
0
0
07 Aug 2025
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Negar Foroutan
Clara Meister
Debjit Paul
Joel Niklaus
Sina Ahmadi
Antoine Bosselut
Rico Sennrich
100
2
0
06 Aug 2025
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Aamod Thakur
Ajay Nagpal
Atharva Savarkar
Kundeshwar Pundalik
Siddhesh Dosi
Piyush Sawarkar
Viraj Thakur
Rohit Saluja
Maunendra Sankar Desarkar
Ganesh Ramakrishnan
44
1
0
03 Aug 2025
SpeLLM: Character-Level Multi-Head Decoding
Amit Ben Artzy
Roy Schwartz
71
1
0
22 Jul 2025
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
A. Owodunni
Orevaoghene Ahia
Sachin Kumar
130
2
0
17 Jul 2025
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
A. Bochkov
134
2
0
07 Jul 2025
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding
Yifan Hu
Frank Liang
Dachuan Zhao
Jonathan Geuter
Varshini Reddy
Craig W. Schmidt
Chris Tanner
136
1
0
18 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers
Diana Abagyan
Alejandro Salamanca
Andres Felipe Cruz-Salinas
Kris Cao
Hangyu Lin
Acyr Locatelli
Marzieh Fadaee
Ahmet Üstün
Sara Hooker
CLL
264
3
0
12 Jun 2025
Canonical Autoregressive Generation
Ivi Chatzi
N. C. Benz
Stratis Tsirtsis
Manuel Gomez Rodriguez
83
1
0
06 Jun 2025
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Anya Sims
Thom Foster
Klara Kaleb
Tuan-Duy H. Nguyen
Joseph Lee
Jakob N. Foerster
Yee Whye Teh
Cong Lu
253
2
0
02 Jun 2025
The State of Large Language Models for African Languages: Progress and Challenges
Kedir Yassin Hussen
W. Sewunetie
Abinew Ali Ayele
Sukairaj Hafiz Imam
Shamsuddeen Hassan Muhammad
Seid Muhie Yimam
217
2
0
02 Jun 2025
Improving Language and Modality Transfer in Translation by Character-level Modeling
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Ioannis Tsiamas
David Dale
Marta R. Costa-jussá
76
2
0
30 May 2025
Multilingual Pretraining for Pixel Language Models
Ilker Kesen
Jonas F. Lotz
Ingo Ziegler
Phillip Rust
Desmond Elliott
MLLM
VLM
217
0
0
27 May 2025
Token-free Models for Sarcasm Detection
Sumit Mamtani
Maitreya Sonawane
Kanika Agarwal
Nishanth Sanjeev
167
0
0
02 May 2025
LogicLearner: A Tool for the Guided Practice of Propositional Logic Proofs
Amogh Inamdar
U. Macar
Michel Vazirani
Michael Tarnow
Zarina Mustapha
Natalia Dittren
Sam Sadeh
Nakul Verma
Ansaf Salleb-Aouissi
LRM
170
0
0
25 Mar 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
M. Bommarito
Daniel Martin Katz
Jillian Bommarito
143
3
0
21 Mar 2025
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
331
22
0
17 Mar 2025
Cross-Lingual IPA Contrastive Learning for Zero-Shot NER
Jimin Sohn
David R. Mortensen
155
0
0
10 Mar 2025
Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Pattern Recognition Letters (Pattern Recogn. Lett.), 2025
Andrea Busto-Castiñeira
Silvia García-Méndez
Francisco de Arriba-Pérez
Francisco J. González Castaño
141
1
0
21 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Ehsaneddin Asgari
Yassine El Kheir
Mohammad Ali Sadraei Javaheri
232
9
0
02 Feb 2025
BinarySelect to Improve Accessibility of Black-Box Attack Research
International Conference on Computational Linguistics (COLING), 2024
Shatarupa Ghosh
Jonathan Rusert
AAML
244
1
0
13 Dec 2024
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhu Xu
Zhiqiang Zhao
Zihan Zhang
Yuchi Liu
Quanwei Shen
Fei Liu
Yu Kuang
Jian He
Conglin Liu
320
3
0
26 Nov 2024
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Kayo Yin
Chinmay Singh
Fyodor O. Minakov
Vanessa Milan
Hal Daumé III
Cyril Zhang
Alex X. Lu
Danielle Bragg
106
5
0
08 Nov 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Langlin Huang
Mengyu Bu
Yang Feng
176
0
0
03 Nov 2024
Morphological Typology in BPE Subword Productivity and Language Modeling
Iñigo Parra
136
0
0
31 Oct 2024
From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Zébulon Goriely
Richard Diehl Martinez
Andrew Caines
Lisa Beinborn
P. Buttery
CLL
162
7
0
30 Oct 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
International Conference on Learning Representations (ICLR), 2024
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
280
13
0
28 Oct 2024
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Artur Kiulian
Anton Polishko
M. Khandoga
Yevhen Kostiuk
Guillermo Gabrielli
...
Hrishikesh Garud
Wendy Wing Yee Mak
Dmytro Chaplynskyi
Selma Belhadj Amor
Grigol Peradze
149
0
0
24 Oct 2024
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems
Nan Xu
Xuezhe Ma
LRM
294
0
0
18 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Kushal Tatariya
Vladimir Araujo
Thomas Bauwens
Miryam de Lhoneux
VLM
156
1
0
15 Oct 2024
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang
Limor Raviv
Lukas Galke
205
5
0
15 Oct 2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Alex Cloud
Jacob Goldman-Wetzler
Evžen Wybitul
Joseph Miller
Alexander Matt Turner
109
13
0
06 Oct 2024
Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner
Tom Lippincott
115
1
0
03 Oct 2024
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
International Conference on Computational Linguistics (COLING), 2024
Menan Velayuthan
Kengatharaiyer Sarveswaran
176
8
0
17 Sep 2024
DiffusionPen: Towards Controlling the Style of Handwritten Text Generation
European Conference on Computer Vision (ECCV), 2024
Konstantina Nikolaidou
George Retsinas
Giorgos Sfikas
Marcus Liwicki
DiffM
186
9
0
09 Sep 2024
Predictability and Causality in Spanish and English Natural Language Generation
IEEE Access (IEEE Access), 2024
Andrea Busto-Castiñeira
Francisco J. González Castaño
Silvia García-Méndez
Francisco de Arriba-Pérez
CML
160
2
0
26 Aug 2024
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Danlu Chen
Freda Shi
Aditi Agarwal
Jacobo Myerston
Taylor Berg-Kirkpatrick
153
2
0
08 Aug 2024
1
2
3
4
Next