Incorporating Context into Subword Vocabularies

13 October 2022

Papers citing "Incorporating Context into Subword Vocabularies"

12 / 12 papers shown

Title
UniNet: A Unified Multi-granular Traffic Modeling Framework for Network Security Binghui Wu D. Divakaran M. Gurusamy 57 0 0 06 Mar 2025
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods Burak Suyunu Enes Taylan Arzucan Özgür 62 1 0 26 Nov 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 39 12 0 08 Oct 2024
Infusing clinical knowledge into tokenisers for language models Abul Hasan Jinge Wu Quang Ngoc Nguyen Salomé Andres Imane Guellil Huayu Zhang Arlene Casey Beatrice Alex Bruce Guthrie Honghan Wu 25 1 0 20 Jun 2024
PatternGPT :A Pattern-Driven Framework for Large Language Model Text Generation Le Xiao Xin Shan 19 4 0 02 Jul 2023
MaxMatch-Dropout: Subword Regularization for WordPiece Tatsuya Hiraoka 27 8 0 09 Sep 2022
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset Peter Henderson M. Krass Lucia Zheng Neel Guha Christopher D. Manning Dan Jurafsky Daniel E. Ho AILaw ELM 129 94 0 01 Jul 2022
Improving Tokenisation by Alternative Treatment of Spaces Edward Gow-Smith Harish Tayyar Madabushi Carolina Scarton Aline Villavicencio 19 20 0 08 Apr 2022
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 69 235 0 31 Dec 2020
Improving Multilingual Models with Language-Clustered Vocabularies Hyung Won Chung Dan Garrette Kiat Chuan Tan Jason Riesa VLM 58 65 0 24 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,927 0 20 Apr 2018
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Yonghui Wu M. Schuster Z. Chen Quoc V. Le Mohammad Norouzi ... Alex Rudnick Oriol Vinyals G. Corrado Macduff Hughes J. Dean AIMat 716 6,724 0 26 Sep 2016