Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.06874
Cited By
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
11 March 2021
J. Clark
Dan Garrette
Iulia Turc
John Wieting
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"
50 / 143 papers shown
Title
Token-free Models for Sarcasm Detection
Sumit Mamtani
Maitreya Sonawane
Kanika Agarwal
Nishanth Sanjeev
36
0
0
02 May 2025
LogicLearner: A Tool for the Guided Practice of Propositional Logic Proofs
Amogh Inamdar
U. Macar
Michel Vazirani
Michael Tarnow
Zarina Mustapha
Natalia Dittren
Sam Sadeh
Nakul Verma
Ansaf Salleb-Aouissi
LRM
35
0
0
25 Mar 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
M. Bommarito
Daniel Martin Katz
Jillian Bommarito
34
1
0
21 Mar 2025
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
43
1
0
17 Mar 2025
Cross-Lingual IPA Contrastive Learning for Zero-Shot NER
Jimin Sohn
David R. Mortensen
47
0
0
10 Mar 2025
Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Andrea Busto-Castiñeira
Silvia García-Méndez
Francisco de Arriba-Pérez
Francisco J. González Castaño
36
0
0
21 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Ehsaneddin Asgari
Yassine El Kheir
Mohammad Ali Sadraei Javaheri
53
0
0
02 Feb 2025
BinarySelect to Improve Accessibility of Black-Box Attack Research
Shatarupa Ghosh
Jonathan Rusert
AAML
72
0
0
13 Dec 2024
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
Kayo Yin
Chinmay Singh
Fyodor O. Minakov
Vanessa Milan
Hal Daumé III
Cyril Zhang
Alex X. Lu
Danielle Bragg
20
2
0
08 Nov 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
Langlin Huang
Mengyu Bu
Yang Feng
21
0
0
03 Nov 2024
Morphological Typology in BPE Subword Productivity and Language Modeling
Iñigo Parra
29
0
0
31 Oct 2024
From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Zébulon Goriely
Richard Diehl Martinez
Andrew Caines
Lisa Beinborn
P. Buttery
CLL
42
5
0
30 Oct 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
30
2
0
28 Oct 2024
LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems
Nan Xu
Xuezhe Ma
LRM
36
3
0
18 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models
Kushal Tatariya
Vladimir Araujo
Thomas Bauwens
Miryam de Lhoneux
VLM
18
0
0
15 Oct 2024
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang
Limor Raviv
Lukas Galke
23
1
0
15 Oct 2024
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
Alex Cloud
Jacob Goldman-Wetzler
Evžen Wybitul
Joseph Miller
Alexander Matt Turner
21
2
0
06 Oct 2024
Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner
Tom Lippincott
16
1
0
03 Oct 2024
Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
Menan Velayuthan
Kengatharaiyer Sarveswaran
24
5
0
17 Sep 2024
DiffusionPen: Towards Controlling the Style of Handwritten Text Generation
Konstantina Nikolaidou
George Retsinas
Giorgos Sfikas
Marcus Liwicki
DiffM
33
3
0
09 Sep 2024
Predictability and Causality in Spanish and English Natural Language Generation
Andrea Busto-Castiñeira
Francisco J. González Castaño
Silvia García-Méndez
Francisco de Arriba-Pérez
CML
46
1
0
26 Aug 2024
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
Danlu Chen
Freda Shi
Aditi Agarwal
Jacobo Myerston
Taylor Berg-Kirkpatrick
29
2
0
08 Aug 2024
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Jacob A. Matthews
John R. Starr
Marten van Schijndel
27
2
0
08 Aug 2024
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
Chenze Shao
Fandong Meng
Jie Zhou
41
1
0
17 Jul 2024
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Orevaoghene Ahia
Sachin Kumar
Hila Gonen
Valentin Hoffman
Tomasz Limisiewicz
Yulia Tsvetkov
Noah A. Smith
38
4
0
11 Jul 2024
CharED: Character-wise Ensemble Decoding for Large Language Models
Kevin Gu
Eva Tuecke
Dmitriy Katz
R. Horesh
David Alvarez-Melis
Mikhail Yurochkin
23
2
0
25 Jun 2024
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Markus Frohmann
Igor Sterner
Ivan Vulić
Benjamin Minixhofer
Markus Schedl
VLM
41
13
0
24 Jun 2024
Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages
Jimin Sohn
Haeji Jung
Alex Cheng
Jooeon Kang
Yilin Du
David R. Mortensen
16
0
0
23 Jun 2024
Tokenization Falling Short: The Curse of Tokenization
Yekun Chai
Yewei Fang
Qiwei Peng
Xuhong Li
26
0
0
17 Jun 2024
Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers
Frederick Riemenschneider
Kevin Krahn
22
2
0
30 May 2024
SoK: Leveraging Transformers for Malware Analysis
Pradip Kunwar
Kshitiz Aryal
Maanak Gupta
Mahmoud Abdelsalam
Elisa Bertino
90
0
0
27 May 2024
Zero-Shot Tokenizer Transfer
Benjamin Minixhofer
E. Ponti
Ivan Vulić
VLM
44
9
0
13 May 2024
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Kevin Slagle
27
3
0
22 Apr 2024
EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque
Aitor García-Pablos
Naiara Pérez
Montse Cuadros
Jaione Bengoetxea
16
0
0
18 Apr 2024
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation
Stephen Lawrence Bothwell
Abigail Swenor
David Chiang
22
1
0
11 Apr 2024
We're Calling an Intervention: Exploring Fundamental Hurdles in Adapting Language Models to Nonstandard Text
Aarohi Srivastava
David Chiang
54
0
0
10 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer
Thomas Hofmann
Imanol Schlag
Tiago Pimentel
34
1
0
09 Apr 2024
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
32
6
0
04 Apr 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Marco Cognetta
Tatsuya Hiraoka
Naoaki Okazaki
Rico Sennrich
Yuval Pinter
29
2
0
30 Mar 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
30
12
0
15 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman
Avi Caciularu
Matan Eyal
Kris Cao
Idan Szpektor
Reut Tsarfaty
43
22
0
10 Mar 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
Carolin Holtermann
Paul Röttger
Timm Dill
Anne Lauscher
ELM
LRM
29
22
0
06 Mar 2024
Efficiently Leveraging Linguistic Priors for Scene Text Spotting
Nguyen Nguyen
Yapeng Tian
Chenliang Xu
45
1
0
27 Feb 2024
Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding
Haeji Jung
Changdae Oh
Jooeon Kang
Jimin Sohn
Kyungwoo Song
Jinkyu Kim
David R. Mortensen
22
0
0
22 Feb 2024
Knowledge of Pretrained Language Models on Surface Information of Tokens
Tatsuya Hiraoka
Naoaki Okazaki
19
1
0
15 Feb 2024
Pixel Sentence Representation Learning
Chenghao Xiao
Zhuoxu Huang
Danlu Chen
G. Hudson
Yizhi Li
Haoran Duan
Chenghua Lin
Jie Fu
Jungong Han
Noura Al Moubayed
SSL
4
2
0
13 Feb 2024
Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect
Jannis Vamvas
Noëmi Aepli
Rico Sennrich
27
0
0
25 Jan 2024
MambaByte: Token-free Selective State Space Model
Junxiong Wang
Tushaar Gangavarapu
Jing Nathan Yan
Alexander M. Rush
Mamba
25
34
0
24 Jan 2024
Anisotropy Is Inherent to Self-Attention in Transformers
Nathan Godey
Eric Villemonte de la Clergerie
Benoît Sagot
13
16
0
22 Jan 2024
Phishing Website Detection through Multi-Model Analysis of HTML Content
Furkan Çolhak
Mert İlhan Ecevit
Bilal Emir Uçar
Reiner Creutzburg
Hasan Dag
16
7
0
09 Jan 2024
1
2
3
Next