ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1910.13267
  4. Cited By
BPE-Dropout: Simple and Effective Subword Regularization

BPE-Dropout: Simple and Effective Subword Regularization

29 October 2019
Ivan Provilkov
Dmitrii Emelianenko
Elena Voita
ArXivPDFHTML

Papers citing "BPE-Dropout: Simple and Effective Subword Regularization"

50 / 147 papers shown
Title
SEA-LION: Southeast Asian Languages in One Network
SEA-LION: Southeast Asian Languages in One Network
Raymond Ng
Thanh Ngan Nguyen
Yuli Huang
Ngee Chia Tai
Wai Yi Leong
...
David Ong Tat-Wee
B. Liu
William-Chandra Tjhi
Erik Cambria
Leslie Teo
36
11
0
08 Apr 2025
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Retrieval-Augmented Purifier for Robust LLM-Empowered Recommendation
Liangbo Ning
Wenqi Fan
Qing Li
AAML
36
1
0
03 Apr 2025
Tokenization of Gaze Data
Tokenization of Gaze Data
Tim Rolff
Jurik Karimian
Niklas Hypki
S. Schmidt
Markus Lappe
Frank Steinicke
38
0
0
28 Mar 2025
SuperBPE: Space Travel for Language Models
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
43
3
0
17 Mar 2025
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation
Wenhui Zhang
Huiyu Xu
Zhibo Wang
Zeqing He
Ziqi Zhu
Kui Ren
AAML
PILM
72
0
0
09 Mar 2025
Deterministic Reversible Data Augmentation for Neural Machine Translation
Deterministic Reversible Data Augmentation for Neural Machine Translation
Jiashu Yao
Heyan Huang
Zeming Liu
Yuhang Guo
51
0
0
21 Feb 2025
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Ehsaneddin Asgari
Yassine El Kheir
Mohammad Ali Sadraei Javaheri
58
0
0
02 Feb 2025
Number Cookbook: Number Understanding of Language Models and How to Improve It
Number Cookbook: Number Understanding of Language Models and How to Improve It
Haotong Yang
Yi Hu
Shijia Kang
Zhouchen Lin
Muhan Zhang
LRM
46
2
0
06 Nov 2024
Zipfian Whitening
Zipfian Whitening
Sho Yokoi
Han Bao
Hiroto Kurita
Hidetoshi Shimodaira
32
0
0
01 Nov 2024
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles
Buu Phan
Brandon Amos
Itai Gat
Marton Havasi
Matthew Muckley
Karen Ullrich
47
1
0
11 Oct 2024
From Tokens to Words: On the Inner Lexicon of LLMs
From Tokens to Words: On the Inner Lexicon of LLMs
Guy Kaplan
Matanel Oren
Yuval Reif
Roy Schwartz
48
12
0
08 Oct 2024
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion
  of Domain-Specific LLMs
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs
Chengyuan Liu
Shihang Wang
Lizhi Qing
Kun Kuang
Yangyang Kang
Changlong Sun
Fei Wu
31
0
0
02 Oct 2024
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization
Kohei Tsuji
Tatsuya Hiraoka
Yuchang Cheng
Tomoya Iwakura
40
1
0
10 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer
  Training
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
42
2
0
06 Sep 2024
Distributional Properties of Subword Regularization
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
35
0
0
21 Aug 2024
Where is the signal in tokenization space?
Where is the signal in tokenization space?
Renato Lui Geh
Honghua Zhang
Kareem Ahmed
Benjie Wang
Guy Van den Broeck
25
4
0
16 Aug 2024
Semantics or spelling? Probing contextual word embeddings with
  orthographic noise
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Jacob A. Matthews
John R. Starr
Marten van Schijndel
37
2
0
08 Aug 2024
Improving Self Consistency in LLMs through Probabilistic Tokenization
Improving Self Consistency in LLMs through Probabilistic Tokenization
Ashutosh Sathe
Divyanshu Aggarwal
Sunayana Sitaram
37
4
0
04 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy
  Failure for Jailbreak Attacks
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Maria Di Eugenio
Yang Zhang
HILM
LRM
52
1
0
01 Jul 2024
SafeAligner: Safety Alignment against Jailbreak Attacks via Response
  Disparity Guidance
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
Caishuang Huang
Wanxu Zhao
Rui Zheng
Huijie Lv
Shihan Dou
...
Junjie Ye
Yuming Yang
Tao Gui
Qi Zhang
Xuanjing Huang
LLMSV
AAML
47
7
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large
  Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Understanding and Mitigating Tokenization Bias in Language Models
Understanding and Mitigating Tokenization Bias in Language Models
Buu Phan
Marton Havasi
Matthew Muckley
Karen Ullrich
44
3
0
24 Jun 2024
Tokenization Falling Short: The Curse of Tokenization
Tokenization Falling Short: The Curse of Tokenization
Yekun Chai
Yewei Fang
Qiwei Peng
Xuhong Li
46
1
0
17 Jun 2024
Threat Modelling and Risk Analysis for Large Language Model
  (LLM)-Powered Applications
Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications
Stephen Burabari Tete
34
7
0
16 Jun 2024
UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for
  Low-Resource Languages
UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages
Trinh Pham
Khoi M. Le
Luu Anh Tuan
42
1
0
14 Jun 2024
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based
  Word Spotter
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
A. Andrusenko
A. Laptev
Vladimir Bataev
Vitaly Lavrukhin
Boris Ginsburg
35
0
0
11 Jun 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
138
53
3
23 May 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in
  Large Language Models
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land
Max Bartolo
28
21
0
08 May 2024
Modeling Orthographic Variation in Occitan's Dialects
Modeling Orthographic Variation in Occitan's Dialects
Zachary Hopton
Noemi Aepli
27
2
0
30 Apr 2024
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective
  Scaffold Token Removal
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Haoran Lian
Yizhe Xiong
Jianwei Niu
Shasha Mo
Zhenpeng Su
Zijia Lin
Peng Liu
Hui Chen
Guiguang Ding
34
1
0
27 Apr 2024
Investigating Neural Machine Translation for Low-Resource Languages:
  Using Bavarian as a Case Study
Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study
Wan-Hua Her
Udo Kruschwitz
37
4
0
12 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language Modelling
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer
Thomas Hofmann
Imanol Schlag
Tiago Pimentel
36
1
0
09 Apr 2024
Sailor: Open Language Models for South-East Asia
Sailor: Open Language Models for South-East Asia
Longxu Dou
Qian Liu
Guangtao Zeng
Jia Guo
Jiahui Zhou
Wei Lu
Min-Bin Lin
LRM
32
7
0
04 Apr 2024
Advancing AI with Integrity: Ethical Challenges and Solutions in Neural
  Machine Translation
Advancing AI with Integrity: Ethical Challenges and Solutions in Neural Machine Translation
Richard Kimera
Yun-Seon Kim
Heeyoul Choi
21
1
0
01 Apr 2024
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Marco Cognetta
Tatsuya Hiraoka
Naoaki Okazaki
Rico Sennrich
Yuval Pinter
29
2
0
30 Mar 2024
Unpacking Tokenization: Evaluating Text Compression and its Correlation
  with Model Performance
Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman
Avi Caciularu
Matan Eyal
Kris Cao
Idan Szpektor
Reut Tsarfaty
45
22
0
10 Mar 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
Omri Uzan
Craig W. Schmidt
Chris Tanner
Yuval Pinter
38
14
0
02 Mar 2024
Tokenization Is More Than Compression
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
40
28
0
28 Feb 2024
Two Counterexamples to Tokenization and the Noiseless Channel
Two Counterexamples to Tokenization and the Noiseless Channel
Marco Cognetta
Vilém Zouhar
Sangwhan Moon
Naoaki Okazaki
27
0
0
22 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
41
86
0
19 Feb 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware
  Decoding
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Jinyuan Jia
Bill Yuchen Lin
Radha Poovendran
AAML
131
85
0
14 Feb 2024
Getting the most out of your tokenizer for pre-training and domain
  adaptation
Getting the most out of your tokenizer for pre-training and domain adaptation
Gautier Dagan
Gabriele Synnaeve
Baptiste Rozière
34
20
0
01 Feb 2024
Importance-Aware Data Augmentation for Document-Level Neural Machine
  Translation
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation
Ming-Ru Wu
Yufei Wang
George F. Foster
Lizhen Qu
Gholamreza Haffari
35
6
0
27 Jan 2024
Enhancing Personality Recognition in Dialogue by Data Augmentation and
  Heterogeneous Conversational Graph Networks
Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks
Yahui Fu
Haiyue Song
Tianyu Zhao
Tatsuya Kawahara
37
1
0
11 Jan 2024
Heterogeneous Encoders Scaling In The Transformer For Neural Machine
  Translation
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation
J. Hu
Roberto Cavicchioli
Giulia Berardinelli
Alessandro Capotondi
38
2
0
26 Dec 2023
Token-Level Adversarial Prompt Detection Based on Perplexity Measures
  and Contextual Information
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu
Gang Wu
Saayan Mitra
Ruiyi Zhang
Tong Sun
Heng-Chiao Huang
Vishy Swaminathan
24
23
0
20 Nov 2023
Improving Large-scale Deep Biasing with Phoneme Features and Text-only
  Data in Streaming Transducer
Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer
Jin Qiu
Lu Huang
Boyu Li
Jun Zhang
Lu Lu
Zejun Ma
21
3
0
15 Nov 2023
Machine Translation for Nko: Tools, Corpora and Baseline Results
Machine Translation for Nko: Tools, Corpora and Baseline Results
M. Doumbouya
Baba Mamadi Diané
Solo Farabado Cissé
Djibrila Diané
Abdoulaye Sow
...
Fodé Moriba Bayo
Ibrahima Sory 2. Condé
Kalo Mory Diané
Chris Piech
Christopher D. Manning
13
3
0
24 Oct 2023
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu
Yuqi Jia
Runpeng Geng
Jinyuan Jia
Neil Zhenqiang Gong
SILM
LLMAG
18
62
0
19 Oct 2023
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Avijit Thawani
Saurabh Ghanekar
Xiaoyuan Zhu
Jay Pujara
32
4
0
17 Oct 2023
123
Next