CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation

v1v2v3v4 (latest)

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Transactions of the Association for Computational Linguistics (TACL), 2021

11 March 2021

ArXiv (abs)PDF HTML HuggingFace (2 upvotes)

Papers citing "CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation"

17 / 167 papers shown

Title
Integrating Approaches to Word Representation Yuval Pinter NAI 144 5 0 10 Sep 2021
Translate & Fill: Improving Zero-Shot Multilingual Semantic Parsing with Synthetic DataConference on Empirical Methods in Natural Language Processing (EMNLP), 2021 Massimo Nicosia Zhongdi Qu Yasemin Altun 120 26 0 09 Sep 2021
You should evaluate your language model on marginal likelihood over tokenisationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021 Kris Cao Laura Rimell 177 30 0 06 Sep 2021
How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? Chantal Amrhein Rico Sennrich 195 14 0 02 Sep 2021
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing Katikapalli Subramanyam Kalyan A. Rajasekharan S. Sangeetha VLM LM&MA 175 303 0 12 Aug 2021
Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information Yuval Pinter Amanda Stent Mark Dredze Jacob Eisenstein 102 7 0 01 Aug 2021
Perceiver IO: A General Architecture for Structured Inputs & OutputsInternational Conference on Learning Representations (ICLR), 2021 Andrew Jaegle Sebastian Borgeaud Jean-Baptiste Alayrac Carl Doersch Catalin Ionescu ... Olivier J. Hénaff M. Botvinick Andrew Zisserman Oriol Vinyals João Carreira MLLM VLM GNN 313 691 0 30 Jul 2021
Local Structure Matters Most: Perturbation Study in NLUFindings (Findings), 2021 Louis Clouâtre Prasanna Parthasarathi Payel Das Sarath Chandar 133 16 0 29 Jul 2021
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization Yi Tay Vinh Q. Tran Sebastian Ruder Jai Gupta Hyung Won Chung Dara Bahri Zhen Qin Simon Baumgartner Cong Yu Donald Metzler 258 182 0 23 Jun 2021
Specializing Multilingual Language Models: An Empirical Study Ethan C. Chau Noah A. Smith 278 28 0 16 Jun 2021
Sub-Character Tokenization for Chinese Pretrained Language ModelsTransactions of the Association for Computational Linguistics (TACL), 2021 Chenglei Si Zhengyan Zhang Yingfa Chen Fanchao Qi Xiaozhi Wang Zhiyuan Liu Yasheng Wang Qun Liu Maosong Sun 152 16 0 01 Jun 2021
ByT5: Towards a token-free future with pre-trained byte-to-byte modelsTransactions of the Association for Computational Linguistics (TACL), 2021 Linting Xue Aditya Barua Noah Constant Rami Al-Rfou Sharan Narang Mihir Kale Adam Roberts Colin Raffel 299 582 0 28 May 2021
Robust Open-Vocabulary Translation from Visual Text RepresentationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2021 Elizabeth Salesky David Etter Matt Post VLM 218 50 0 16 Apr 2021
XTREME-R: Towards More Challenging and Nuanced Multilingual EvaluationConference on Empirical Methods in Natural Language Processing (EMNLP), 2021 Sebastian Ruder Noah Constant Jan A. Botha Aditya Siddhant Orhan Firat ... Pengfei Liu Junjie Hu Dan Garrette Graham Neubig Melvin Johnson ELM AAML LRM 180 208 0 15 Apr 2021
Inducing Meaningful Units from Character Sequences with Dynamic Capacity Slot Attention Melika Behjati James Henderson OCL 112 1 0 01 Feb 2021
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2020 Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 350 295 0 31 Dec 2020
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse LanguagesTransactions of the Association for Computational Linguistics (TACL), 2020 J. Clark Eunsol Choi Michael Collins Dan Garrette Tom Kwiatkowski Vitaly Nikolaev J. Palomaki 368 675 0 10 Mar 2020