SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,064 papers shown

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

140

03 Dec 2025

Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Kanchon Gharami

Quazi Sarwar Muhtaseem

Deepti Gupta

Lavanya Elluri

Shafika Showkat Moni

134

27 Nov 2025

Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

Alex Ning

Vainateya Rangaraju

Yen-Ling Kuo

147

26 Nov 2025

Length-MAX Tokenizer for Language Models

Dong Dong

Weijie Su

VLM

199

25 Nov 2025

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

17 Nov 2025

A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation

226

08 Nov 2025

LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

111

07 Nov 2025

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Firoj Ahmmed Patwary

Abdullah Al Noman

07 Nov 2025

UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

05 Nov 2025

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Saumitra Yadav

Manish Shrivastava

161

05 Nov 2025

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Gabriel Pirlogeanu

Alexandru-Lucian Georgescu

Horia Cucu

05 Nov 2025

IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs

117

05 Nov 2025

Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman

Thomas Bauwens

Miryam de Lhoneux

104

03 Nov 2025

Fast, memory-efficient genomic interval tokenizers for modern machine learning

Nathan J. LeRoy

Donald R. Campbell Jr

Seth Stadick

Oleksandr Khoroshevskyi

Sang-Hoon Park

Ziyang Hu

Nathan C. Sheffield

157

03 Nov 2025

Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Rajan Agarwal

Aarush Gupta

133

31 Oct 2025

Modular Linear Tokenization (MLT)

Tcharlies Schmitz

29 Oct 2025

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Idriss Nguepi Nguefack

Mara Finkelstein

Toadoum Sari Sakayo

115

29 Oct 2025

Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

Tegawende F. Bissyande

Jacques Klein

ELM

261

28 Oct 2025

How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data

Christos Thrampoulidis

252

27 Oct 2025

M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR

109

25 Oct 2025

Pctx: Tokenizing Personalized Context for Generative Recommendation

127

24 Oct 2025

ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

121

24 Oct 2025

Explaining and Mitigating Crosslingual Tokenizer Inequities

163

24 Oct 2025

Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

...

116

22 Oct 2025

Data-Centric Lessons To Improve Speech-Language Pretraining

140

22 Oct 2025

Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

160

21 Oct 2025

See the Text: From Tokenization to Visual Reading

159

21 Oct 2025

Accelerating Vision Transformers with Adaptive Patch Sizes

123

20 Oct 2025

Zero-Shot Performance Prediction for Probabilistic Scaling Laws

132

19 Oct 2025

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

163

17 Oct 2025

Selecting and Combining Large Language Models for Scalable Code Clone Detection

156

17 Oct 2025

Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

132

16 Oct 2025

Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation

196

15 Oct 2025

Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

...

198

15 Oct 2025

VaultGemma: A Differentially Private Gemma Model

Christopher A. Choquette-Choo

...

291

15 Oct 2025

End-to-End Multi-Modal Diffusion Mamba

134

15 Oct 2025

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Hailay Teklehaymanot

Wolfgang Nejdl

100

14 Oct 2025

Harnessing Consistency for Robust Test-Time LLM Ensemble

131

12 Oct 2025

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

116

11 Oct 2025

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

126

11 Oct 2025

Serialized EHR make for good text representations

Zhirong Chou

Quan Qin

Shi Li

11 Oct 2025

Hierarchical Scheduling for Multi-Vector Image Retrieval

122

10 Oct 2025

SkipSR: Faster Super Resolution with Token Skipping

223

09 Oct 2025

Lossless Vocabulary Reduction for Auto-Regressive Language Models

104

09 Oct 2025

Vision-Language-Action Models for Robotics: A Review Towards Real-World ApplicationsIEEE Access (IEEE Access), 2025

263

08 Oct 2025

Latent Speech-Text Transformer

...

128

07 Oct 2025

Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework

252

07 Oct 2025

Large Language Models Hallucination: A Comprehensive Survey

Aisha Alansari

Hamzah Luqman

HILM LRM

461

05 Oct 2025

Multi Language Models for On-the-Fly Syntax Highlighting

Marco Edoardo Palma

Pooja Rani

Harald C. Gall

116

05 Oct 2025

Evaluating Embedding Frameworks for Scientific Domain

Nouman Ahmed

R. Wu

Victor Botev

146

03 Oct 2025