v1v2v3v4 (latest)

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Transactions of the Association for Computational Linguistics (TACL), 2021

11 March 2021

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"

50 / 167 papers shown

Title
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 Preston Firestone Shubham Ugare Gagandeep Singh Sasa Misailovic 0 1 0 05 Nov 2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection Rajan Agarwal Aarush Gupta 36 0 0 31 Oct 2025
Explaining and Mitigating Crosslingual Tokenizer Inequities Catherine Arnett T. Chang Stella Biderman Benjamin Bergen 40 0 0 24 Oct 2025
From Characters to Tokens: Dynamic Grouping with Hierarchical BPE Rares Dolga Lucas Maystre Tudor Berariu David Barber 24 0 0 17 Oct 2025
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling Huiyin Xue Nafise Sadat Moosavi Nikolaos Aletras 40 0 0 13 Oct 2025
Quick-CapsNet (QCN): A fast alternative to Capsule NetworksACS/IEEE International Conference on Computer Systems and Applications (AICCSA), 2020 Pouya Shiri Ramin Sharifi A. Baniasadi 3DPC 92 0 0 08 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework Mosong Ma Tania Stathaki Michalis Lazarou MedIm GAN 113 0 0 07 Oct 2025
LongTail-Swap: benchmarking language models' abilities on rare words Robin Algayres Charles-Éric Saint-James Mahi Luthra Jiayi Shen Dongyan Lin Youssef Benchekroun Rashel Moritz Juan Pino Emmanuel Dupoux 52 0 0 05 Oct 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models Julie Kallini Dan Jurafsky Christopher Potts Martijn Bartelds 73 0 0 23 Sep 2025
chDzDT: Word-level morphology-aware language model for Algerian social media text Abdelkrime Aries 36 0 0 01 Sep 2025
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training Woojin Chung Jeonghoon Kim 80 0 0 21 Aug 2025
Quo Vadis Handwritten Text Generation for Handwritten Text Recognition? Vittorio Pippi Konstantina Nikolaidou S. Cascianelli George Retsinas Giorgos Sfikas Rita Cucchiara Marcus Liwicki DiffM 67 0 0 13 Aug 2025
DeCAL Tokenwise Compression Sameer Panwar 64 0 0 11 Aug 2025
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages Mehrdad Zakershahrak Samira Ghodratnama VLM 32 0 0 07 Aug 2025
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization Negar Foroutan Clara Meister Debjit Paul Joel Niklaus Sina Ahmadi Antoine Bosselut Rico Sennrich 100 2 0 06 Aug 2025
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design Aamod Thakur Ajay Nagpal Atharva Savarkar Kundeshwar Pundalik Siddhesh Dosi Piyush Sawarkar Viraj Thakur Rohit Saluja Maunendra Sankar Desarkar Ganesh Ramakrishnan 44 1 0 03 Aug 2025
SpeLLM: Character-Level Multi-Head Decoding Amit Ben Artzy Roy Schwartz 71 1 0 22 Jul 2025
FLEXITOKENS: Flexible Tokenization for Evolving Language Models A. Owodunni Orevaoghene Ahia Sachin Kumar 130 2 0 17 Jul 2025
Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations A. Bochkov 134 2 0 07 Jul 2025
Entropy-Driven Pre-Tokenization for Byte-Pair Encoding Yifan Hu Frank Liang Dachuan Zhao Jonathan Geuter Varshini Reddy Craig W. Schmidt Chris Tanner 136 1 0 18 Jun 2025
One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers Diana Abagyan Alejandro Salamanca Andres Felipe Cruz-Salinas Kris Cao Hangyu Lin Acyr Locatelli Marzieh Fadaee Ahmet Üstün Sara Hooker CLL 264 3 0 12 Jun 2025
Canonical Autoregressive Generation Ivi Chatzi N. C. Benz Stratis Tsirtsis Manuel Gomez Rodriguez 83 1 0 06 Jun 2025
StochasTok: Improving Fine-Grained Subword Understanding in LLMs Anya Sims Thom Foster Klara Kaleb Tuan-Duy H. Nguyen Joseph Lee Jakob N. Foerster Yee Whye Teh Cong Lu 253 2 0 02 Jun 2025
The State of Large Language Models for African Languages: Progress and Challenges Kedir Yassin Hussen W. Sewunetie Abinew Ali Ayele Sukairaj Hafiz Imam Shamsuddeen Hassan Muhammad Seid Muhie Yimam 217 2 0 02 Jun 2025
Improving Language and Modality Transfer in Translation by Character-level ModelingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025 Ioannis Tsiamas David Dale Marta R. Costa-jussá 76 2 0 30 May 2025
Multilingual Pretraining for Pixel Language Models Ilker Kesen Jonas F. Lotz Ingo Ziegler Phillip Rust Desmond Elliott MLLM VLM 217 0 0 27 May 2025
Token-free Models for Sarcasm Detection Sumit Mamtani Maitreya Sonawane Kanika Agarwal Nishanth Sanjeev 167 0 0 02 May 2025
LogicLearner: A Tool for the Guided Practice of Propositional Logic Proofs Amogh Inamdar U. Macar Michel Vazirani Michael Tarnow Zarina Mustapha Natalia Dittren Sam Sadeh Nakul Verma Ansaf Salleb-Aouissi LRM 170 0 0 25 Mar 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications M. Bommarito Daniel Martin Katz Jillian Bommarito 143 3 0 21 Mar 2025
SuperBPE: Space Travel for Language Models Alisa Liu J. Hayase Valentin Hofmann Sewoong Oh Noah A. Smith Yejin Choi 331 22 0 17 Mar 2025
Cross-Lingual IPA Contrastive Learning for Zero-Shot NER Jimin Sohn David R. Mortensen 155 0 0 10 Mar 2025
Optimal word order for non-causal text generation with Large Language Models: the Spanish casePattern Recognition Letters (Pattern Recogn. Lett.), 2025 Andrea Busto-Castiñeira Silvia García-Méndez Francisco de Arriba-Pérez Francisco J. González Castaño 141 1 0 21 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies Ehsaneddin Asgari Yassine El Kheir Mohammad Ali Sadraei Javaheri 232 9 0 02 Feb 2025
BinarySelect to Improve Accessibility of Black-Box Attack ResearchInternational Conference on Computational Linguistics (COLING), 2024 Shatarupa Ghosh Jonathan Rusert AAML 244 1 0 13 Dec 2024
Enhancing Character-Level Understanding in LLMs through Token Internal Structure LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Zhu Xu Zhiqiang Zhao Zihan Zhang Yuchi Liu Quanwei Shen Fei Liu Yu Kuang Jian He Conglin Liu 320 3 0 26 Nov 2024
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM ArticlesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Kayo Yin Chinmay Singh Fyodor O. Minakov Vanessa Milan Hal Daumé III Cyril Zhang Alex X. Lu Danielle Bragg 106 5 0 08 Nov 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine TranslationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024 Langlin Huang Mengyu Bu Yang Feng 176 0 0 03 Nov 2024
Morphological Typology in BPE Subword Productivity and Language Modeling Iñigo Parra 136 0 0 31 Oct 2024
From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes Zébulon Goriely Richard Diehl Martinez Andrew Caines Lisa Beinborn P. Buttery CLL 162 7 0 30 Oct 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language ModelsInternational Conference on Learning Representations (ICLR), 2024 Julie Kallini Shikhar Murty Christopher D. Manning Christopher Potts Róbert Csordás 280 13 0 28 Oct 2024
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages Artur Kiulian Anton Polishko M. Khandoga Yevhen Kostiuk Guillermo Gabrielli ... Hrishikesh Garud Wendy Wing Yee Mak Dmytro Chaplynskyi Selma Belhadj Amor Grigol Peradze 149 0 0 24 Oct 2024
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems Nan Xu Xuezhe Ma LRM 294 0 0 18 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024 Kushal Tatariya Vladimir Araujo Thomas Bauwens Miryam de Lhoneux VLM 156 1 0 15 Oct 2024
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5 Thao Anh Dang Limor Raviv Lukas Galke 205 5 0 15 Oct 2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks Alex Cloud Jacob Goldman-Wetzler Evžen Wybitul Joseph Miller Alexander Matt Turner 109 13 0 06 Oct 2024
Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus Craig Messner Tom Lippincott 115 1 0 03 Oct 2024
Egalitarian Language Representation in Language Models: It All Begins with TokenizersInternational Conference on Computational Linguistics (COLING), 2024 Menan Velayuthan Kengatharaiyer Sarveswaran 176 8 0 17 Sep 2024
DiffusionPen: Towards Controlling the Style of Handwritten Text GenerationEuropean Conference on Computer Vision (ECCV), 2024 Konstantina Nikolaidou George Retsinas Giorgos Sfikas Marcus Liwicki DiffM 186 9 0 09 Sep 2024
Predictability and Causality in Spanish and English Natural Language GenerationIEEE Access (IEEE Access), 2024 Andrea Busto-Castiñeira Francisco J. González Castaño Silvia García-Méndez Francisco de Arriba-Pérez CML 160 2 0 26 Aug 2024
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLPAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Danlu Chen Freda Shi Aditi Agarwal Jacobo Myerston Taylor Berg-Kirkpatrick 153 2 0 08 Aug 2024