v1v2v3v4 (latest)

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Transactions of the Association for Computational Linguistics (TACL), 2021

11 March 2021

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"

50 / 166 papers shown

Title
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization Orevaoghene Ahia Sachin Kumar Hila Gonen Valentin Hoffman Tomasz Limisiewicz Yulia Tsvetkov Noah A. Smith 200 15 0 11 Jul 2024
CharED: Character-wise Ensemble Decoding for Large Language Models Kevin Gu Eva Tuecke Dmitriy Katz R. Horesh David Alvarez-Melis Mikhail Yurochkin 140 3 0 25 Jun 2024
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation Markus Frohmann Igor Sterner Ivan Vulić Benjamin Minixhofer Markus Schedl VLM 169 29 0 24 Jun 2024
Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages Jimin Sohn Haeji Jung Alex Cheng Jooeon Kang Yilin Du David R. Mortensen 111 1 0 23 Jun 2024
Tokenization Falling Short: The Curse of Tokenization Yekun Chai Yewei Fang Qiwei Peng Xuhong Li 151 12 0 17 Jun 2024
Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers Frederick Riemenschneider Kevin Krahn 115 3 0 30 May 2024
SoK: Leveraging Transformers for Malware Analysis Pradip Kunwar Kshitiz Aryal Maanak Gupta Mahmoud Abdelsalam Elisa Bertino 282 2 0 27 May 2024
Zero-Shot Tokenizer TransferNeural Information Processing Systems (NeurIPS), 2024 Benjamin Minixhofer Edoardo Ponti Ivan Vulić VLM 131 23 0 13 May 2024
SpaceByte: Towards Deleting Tokenization from Large Language Modeling Kevin Slagle 164 10 0 22 Apr 2024
EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque Aitor García-Pablos Naiara Pérez Montse Cuadros Jaione Bengoetxea 141 0 0 18 Apr 2024
Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation Stephen Lawrence Bothwell Abigail Swenor David Chiang 107 1 0 11 Apr 2024
We're Calling an Intervention: Exploring Fundamental Hurdles in Adapting Language Models to Nonstandard Text Aarohi Srivastava David Chiang 218 3 0 10 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language ModellingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Anton Schäfer Thomas Hofmann Imanol Schlag Tiago Pimentel 170 3 0 09 Apr 2024
Training LLMs over Neurally Compressed Text Brian Lester Jaehoon Lee A. Alemi Jeffrey Pennington Adam Roberts Jascha Narain Sohl-Dickstein Noah Constant 151 9 0 04 Apr 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation Marco Cognetta Tatsuya Hiraoka Naoaki Okazaki Rico Sennrich Yuval Pinter 184 2 0 30 Mar 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling Tomasz Limisiewicz Terra Blevins Hila Gonen Orevaoghene Ahia Luke Zettlemoyer 152 25 0 15 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model PerformanceAnnual Meeting of the Association for Computational Linguistics (ACL), 2024 Omer Goldman Avi Caciularu Matan Eyal Kris Cao Idan Szpektor Reut Tsarfaty 181 39 0 10 Mar 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ Carolin Holtermann Paul Röttger Timm Dill Anne Lauscher ELM LRM 175 32 0 06 Mar 2024
Efficiently Leveraging Linguistic Priors for Scene Text Spotting Nguyen Nguyen Yapeng Tian Chenliang Xu 153 2 0 27 Feb 2024
Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding Haeji Jung Changdae Oh Jooeon Kang Jimin Sohn Kyungwoo Song Jinkyu Kim David R. Mortensen 122 2 0 22 Feb 2024
Knowledge of Pretrained Language Models on Surface Information of Tokens Tatsuya Hiraoka Naoaki Okazaki 142 4 0 15 Feb 2024
Pixel Sentence Representation Learning Chenghao Xiao Zhuoxu Huang Danlu Chen G. Hudson Yi Zhou Haoran Duan Chenghua Lin Jie Fu Jungong Han Noura Al Moubayed SSL 105 5 0 13 Feb 2024
Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect Jannis Vamvas Noëmi Aepli Rico Sennrich 172 1 0 25 Jan 2024
MambaByte: Token-free Selective State Space Model Junxiong Wang Tushaar Gangavarapu Jing Nathan Yan Alexander M. Rush Mamba 180 50 0 24 Jan 2024
Anisotropy Is Inherent to Self-Attention in TransformersConference of the European Chapter of the Association for Computational Linguistics (EACL), 2024 Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 152 29 0 22 Jan 2024
Phishing Website Detection through Multi-Model Analysis of HTML Content Furkan Çolhak Mert İlhan Ecevit Bilal Emir Uçar Reiner Creutzburg Hasan Dag 126 10 0 09 Jan 2024
SecureReg: Combining NLP and MLP for Enhanced Detection of Malicious Domain Name Registrations Furkan cColhak Mert İlhan Ecevit Hasan Daug Reiner Creutzburg 92 0 0 06 Jan 2024
Learning Mutually Informed Representations for Characters and Subwords Yilin Wang Xinyi Hu Matthew R. Gormley 133 0 0 14 Nov 2023
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew Eylon Gueta Omer Goldman Reut Tsarfaty 87 3 0 01 Nov 2023
Text Rendering Strategies for Pixel Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Jonas F. Lotz Elizabeth Salesky Phillip Rust Desmond Elliott VLM 171 14 0 01 Nov 2023
Learning to Abstract with Nonparametric Variational Information BottleneckConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Melika Behjati Fabio Fehr James Henderson SSL 125 4 0 26 Oct 2023
Analyzing Cognitive Plausibility of Subword Tokenization Lisa Beinborn Yuval Pinter 133 27 0 20 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Avijit Thawani Saurabh Ghanekar Xiaoyuan Zhu Jay Pujara 186 9 0 17 Oct 2023
Optimized Tokenization for Transcribed Error CorrectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Tomer Wullach Shlomo E. Chazan 132 0 0 16 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial? Mehdi Ali Michael Fromm Klaudia Thellmann Richard Rutmann Max Lübbering ... Malte Ostendorff Samuel Weinbach R. Sifa Stefan Kesselheim Nicolas Flores-Herr 233 85 0 12 Oct 2023
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer Md. Mushfiqur Rahman Fardin Ahsan Sakib Fahim Faisal Antonios Anastasopoulos 120 4 0 12 Oct 2023
Pit One Against Many: Leveraging Attention-head Embeddings for Parameter-efficient Multi-head AttentionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023 Huiyin Xue Nikolaos Aletras 173 1 0 11 Oct 2023
Syllable-level lyrics generation from melody exploiting character-level language modelFindings (Findings), 2023 Zhe Zhang Karol Lasocki Yi Yu Atsuhiro Takasu 128 6 0 02 Oct 2023
Assessment of Pre-Trained Models Across Languages and GrammarsInternational Joint Conference on Natural Language Processing (IJCNLP), 2023 Alberto Muñoz-Ortiz David Vilares Carlos Gómez-Rodríguez 135 4 0 20 Sep 2023
A multimodal deep learning architecture for smoking detection with a small data approachmedRxiv (medRxiv), 2023 Róbert Lakatos P. Pollner András Hajdu Tamas Joo 70 11 0 19 Sep 2023
Multilingual Text Representation Fahim Faisal 129 0 0 02 Sep 2023
Lightweight Adaptation of Neural Language Models via Subspace EmbeddingInternational Conference on Information and Knowledge Management (CIKM), 2023 Amit Kumar Jaiswal Haiming Liu 116 2 0 16 Aug 2023
Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT Jing Yang Cong Liu Wendy Deng Dangwei Wu Chunhua Weng Yunyun Zhou Kai Wang 101 29 0 11 Aug 2023
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source CodeInternational Conference on Learning Representations (ICLR), 2023 Nadezhda Chirkova Sergey Troshin 169 9 0 01 Aug 2023
Biomedical Language Models are Robust to Sub-optimal TokenizationWorkshop on Biomedical Natural Language Processing (BioNLP), 2023 Bernal Jiménez Gutiérrez Huan Sun Yu-Chuan Su 103 8 0 30 Jun 2023
Is Anisotropy Inherent to Transformers? Nathan Godey Eric Villemonte de la Clergerie Benoît Sagot 130 4 0 13 Jun 2023
When Vision Fails: Text Attacks Against ViT and OCR Nicholas Boucher Jenny Blessing Ilia Shumailov Ross J. Anderson Nicolas Papernot AAML 115 4 0 12 Jun 2023
Hierarchical Attention Encoder Decoder Asier Mujika BDL 137 4 0 01 Jun 2023
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence SegmentationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Benjamin Minixhofer Jonas Pfeiffer Ivan Vulić 141 22 0 30 May 2023
Byte-Level Grammatical Error Correction Using Synthetic and Curated CorporaAnnual Meeting of the Association for Computational Linguistics (ACL), 2023 Svanhvít Lilja Ingólfsdóttir Pétur Orri Ragnarsson H. Jónsson Haukur Barri Símonarson Vilhjálmur Þorsteinsson Vésteinn Snæbjarnarson SyDa 130 11 0 29 May 2023