BPE-Dropout: Simple and Effective Subword Regularization

29 October 2019

Papers citing "BPE-Dropout: Simple and Effective Subword Regularization"

50 / 147 papers shown

Title
SEA-LION: Southeast Asian Languages in One Network Raymond Ng Thanh Ngan Nguyen Yuli Huang Ngee Chia Tai Wai Yi Leong ... David Ong Tat-Wee B. Liu William-Chandra Tjhi Erik Cambria Leslie Teo 36 11 0 08 Apr 2025
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation Liangbo Ning Wenqi Fan Qing Li AAML 36 1 0 03 Apr 2025
Tokenization of Gaze Data Tim Rolff Jurik Karimian Niklas Hypki S. Schmidt Markus Lappe Frank Steinicke 38 0 0 28 Mar 2025
SuperBPE: Space Travel for Language Models Alisa Liu J. Hayase Valentin Hofmann Sewoong Oh Noah A. Smith Yejin Choi 43 3 0 17 Mar 2025
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation Wenhui Zhang Huiyu Xu Zhibo Wang Zeqing He Ziqi Zhu Kui Ren AAML PILM 72 0 0 09 Mar 2025
Deterministic Reversible Data Augmentation for Neural Machine Translation Jiashu Yao Heyan Huang Zeming Liu Yuhang Guo 51 0 0 21 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies Ehsaneddin Asgari Yassine El Kheir Mohammad Ali Sadraei Javaheri 58 0 0 02 Feb 2025
Number Cookbook: Number Understanding of Language Models and How to Improve It Haotong Yang Yi Hu Shijia Kang Zhouchen Lin Muhan Zhang LRM 46 2 0 06 Nov 2024
Zipfian Whitening Sho Yokoi Han Bao Hiroto Kurita Hidetoshi Shimodaira 32 0 0 01 Nov 2024
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles Buu Phan Brandon Amos Itai Gat Marton Havasi Matthew Muckley Karen Ullrich 47 1 0 11 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs Guy Kaplan Matanel Oren Yuval Reif Roy Schwartz 48 12 0 08 Oct 2024
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs Chengyuan Liu Shihang Wang Lizhi Qing Kun Kuang Yangyang Kang Changlong Sun Fei Wu 31 0 0 02 Oct 2024
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization Kohei Tsuji Tatsuya Hiraoka Yuchang Cheng Tomoya Iwakura 40 1 0 10 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Pavel Chizhov Catherine Arnett Elizaveta Korotkova Ivan P. Yamshchikov 42 2 0 06 Sep 2024
Distributional Properties of Subword Regularization Marco Cognetta Vilém Zouhar Naoaki Okazaki 35 0 0 21 Aug 2024
Where is the signal in tokenization space? Renato Lui Geh Honghua Zhang Kareem Ahmed Benjie Wang Guy Van den Broeck 25 4 0 16 Aug 2024
Semantics or spelling? Probing contextual word embeddings with orthographic noise Jacob A. Matthews John R. Starr Marten van Schijndel 37 2 0 08 Aug 2024
Improving Self Consistency in LLMs through Probabilistic Tokenization Ashutosh Sathe Divyanshu Aggarwal Sunayana Sitaram 37 4 0 04 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks Yue Zhou Henry Peng Zou Barbara Maria Di Eugenio Yang Zhang HILM LRM 52 1 0 01 Jul 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance Caishuang Huang Wanxu Zhao Rui Zheng Huijie Lv Shihan Dou ... Junjie Ye Yuming Yang Tao Gui Qi Zhang Xuanjing Huang LLMSV AAML 47 7 0 26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Haibo Jin Leyang Hu Xinuo Li Peiyan Zhang Chonghan Chen Jun Zhuang Haohan Wang PILM 36 26 0 26 Jun 2024
Understanding and Mitigating Tokenization Bias in Language Models Buu Phan Marton Havasi Matthew Muckley Karen Ullrich 44 3 0 24 Jun 2024
Tokenization Falling Short: The Curse of Tokenization Yekun Chai Yewei Fang Qiwei Peng Xuhong Li 46 1 0 17 Jun 2024
Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications Stephen Burabari Tete 34 7 0 16 Jun 2024
UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages Trinh Pham Khoi M. Le Luu Anh Tuan 42 1 0 14 Jun 2024
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter A. Andrusenko A. Laptev Vladimir Bataev Vitaly Lavrukhin Boris Ginsburg 35 0 0 11 Jun 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models Stella Biderman Hailey Schoelkopf Lintang Sutawika Leo Gao J. Tow ... Xiangru Tang Kevin A. Wang Genta Indra Winata Franccois Yvon Andy Zou ELM ALM 138 53 3 23 May 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models Sander Land Max Bartolo 28 21 0 08 May 2024
Modeling Orthographic Variation in Occitan's Dialects Zachary Hopton Noemi Aepli 27 2 0 30 Apr 2024
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal Haoran Lian Yizhe Xiong Jianwei Niu Shasha Mo Zhenpeng Su Zijia Lin Peng Liu Hui Chen Guiguang Ding 34 1 0 27 Apr 2024
Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study Wan-Hua Her Udo Kruschwitz 37 4 0 12 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language Modelling Anton Schäfer Thomas Hofmann Imanol Schlag Tiago Pimentel 36 1 0 09 Apr 2024
Sailor: Open Language Models for South-East Asia Longxu Dou Qian Liu Guangtao Zeng Jia Guo Jiahui Zhou Wei Lu Min-Bin Lin LRM 32 7 0 04 Apr 2024
Advancing AI with Integrity: Ethical Challenges and Solutions in Neural Machine Translation Richard Kimera Yun-Seon Kim Heeyoul Choi 21 1 0 01 Apr 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation Marco Cognetta Tatsuya Hiraoka Naoaki Okazaki Rico Sennrich Yuval Pinter 29 2 0 30 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance Omer Goldman Avi Caciularu Matan Eyal Kris Cao Idan Szpektor Reut Tsarfaty 45 22 0 10 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods Omri Uzan Craig W. Schmidt Chris Tanner Yuval Pinter 38 14 0 02 Mar 2024
Tokenization Is More Than Compression Craig W. Schmidt Varshini Reddy Haoran Zhang Alec Alameddine Omri Uzan Yuval Pinter Chris Tanner 40 28 0 28 Feb 2024
Two Counterexamples to Tokenization and the Noiseless Channel Marco Cognetta Vilém Zouhar Sangwhan Moon Naoaki Okazaki 27 0 0 22 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs Fengqing Jiang Zhangchen Xu Luyao Niu Zhen Xiang Bhaskar Ramasubramanian Bo Li Radha Poovendran 41 86 0 19 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding Zhangchen Xu Fengqing Jiang Luyao Niu Jinyuan Jia Bill Yuchen Lin Radha Poovendran AAML 131 85 0 14 Feb 2024
Getting the most out of your tokenizer for pre-training and domain adaptation Gautier Dagan Gabriele Synnaeve Baptiste Rozière 34 20 0 01 Feb 2024
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation Ming-Ru Wu Yufei Wang George F. Foster Lizhen Qu Gholamreza Haffari 35 6 0 27 Jan 2024
Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks Yahui Fu Haiyue Song Tianyu Zhao Tatsuya Kawahara 37 1 0 11 Jan 2024
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation J. Hu Roberto Cavicchioli Giulia Berardinelli Alessandro Capotondi 38 2 0 26 Dec 2023
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information Zhengmian Hu Gang Wu Saayan Mitra Ruiyi Zhang Tong Sun Heng-Chiao Huang Vishy Swaminathan 24 23 0 20 Nov 2023
Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer Jin Qiu Lu Huang Boyu Li Jun Zhang Lu Lu Zejun Ma 21 3 0 15 Nov 2023
Machine Translation for Nko: Tools, Corpora and Baseline Results M. Doumbouya Baba Mamadi Diané Solo Farabado Cissé Djibrila Diané Abdoulaye Sow ... Fodé Moriba Bayo Ibrahima Sory 2. Condé Kalo Mory Diané Chris Piech Christopher D. Manning 13 3 0 24 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses Yupei Liu Yuqi Jia Runpeng Geng Jinyuan Jia Neil Zhenqiang Gong SILM LLMAG 18 62 0 19 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling Avijit Thawani Saurabh Ghanekar Xiaoyuan Zhu Jay Pujara 32 4 0 17 Oct 2023