Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (10925★)
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 2,061 papers shown
Title
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Taido Purason
Pavel Chizhov
Ivan P. Yamshchikov
Mark Fishel
CLL
VLM
88
0
0
03 Dec 2025
Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration
Kanchon Gharami
Quazi Sarwar Muhtaseem
Deepti Gupta
Lavanya Elluri
Shafika Showkat Moni
93
0
0
27 Nov 2025
Visualizing LLM Latent Space Geometry Through Dimensionality Reduction
Alex Ning
Vainateya Rangaraju
132
0
0
26 Nov 2025
Length-MAX Tokenizer for Language Models
Dong Dong
Weijie Su
VLM
174
0
0
25 Nov 2025
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
Siyuan Li
Kai Yu
Anna Wang
Zicheng Liu
Chang Yu
Jingbo Zhou
Qirong Yang
Yucheng Guo
Xiaoming Zhang
Stan Z. Li
92
0
0
17 Nov 2025
A Remarkably Efficient Paradigm to Multimodal Large Language Models for Sequential Recommendation
Qiyong Zhong
Jiajie Su
Ming Yang
Yunshan Ma
Xiaolin Zheng
Chaochao Chen
182
0
0
08 Nov 2025
LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model
Wei Shao
Lingchao Zheng
Pengyu Wang
Peizhen Zheng
Jun Li
Yuwei Fan
82
0
0
07 Nov 2025
Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
Firoj Ahmmed Patwary
Abdullah Al Noman
68
0
0
07 Nov 2025
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana
Arul Menezes
Ashish Kulkarni
Chandra Khatri
Shubham Agarwal
108
0
0
05 Nov 2025
Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Gabriel Pirlogeanu
Alexandru-Lucian Georgescu
Horia Cucu
80
0
0
05 Nov 2025
Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Saumitra Yadav
Manish Shrivastava
146
0
0
05 Nov 2025
UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8
Preston Firestone
Shubham Ugare
Gagandeep Singh
Sasa Misailovic
92
1
0
05 Nov 2025
Confounding Factors in Relating Model Performance to Morphology
Wessel Poelman
Thomas Bauwens
Miryam de Lhoneux
96
2
0
03 Nov 2025
Fast, memory-efficient genomic interval tokenizers for modern machine learning
Nathan J. LeRoy
Donald R. Campbell Jr
Seth Stadick
Oleksandr Khoroshevskyi
Sang-Hoon Park
Ziyang Hu
Nathan C. Sheffield
140
1
0
03 Nov 2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Rajan Agarwal
Aarush Gupta
126
0
0
31 Oct 2025
Modular Linear Tokenization (MLT)
Tcharlies Schmitz
52
0
0
29 Oct 2025
Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation
Idriss Nguepi Nguefack
Mara Finkelstein
Toadoum Sari Sakayo
81
0
0
29 Oct 2025
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li
Yewei Song
Lama Sleem
Yiqun Wang
Yangjie Xu
Cedric Lothritz
Niccolo Gentile
Radu State
Tegawende F. Bissyande
Jacques Klein
ELM
236
0
0
28 Oct 2025
How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
Bhavya Vasudeva
Puneesh Deora
Yize Zhao
Vatsal Sharan
Christos Thrampoulidis
192
0
0
27 Oct 2025
M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR
Ruixiang Mao
Xiangnan Ma
Qing Yang
Ziming Zhu
Yucheng Qiao
Yuan Ge
Tong Xiao
Shengxiang Gao
Zhengtao Yu
Jingbo Zhu
84
0
0
25 Oct 2025
Explaining and Mitigating Crosslingual Tokenizer Inequities
Catherine Arnett
T. Chang
Stella Biderman
Benjamin Bergen
148
0
0
24 Oct 2025
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre
Sneha Kudugunta
Niklas Muennighoff
I-Hung Hsu
Isaac Caswell
Alex Pentland
Sercan O. Arik
Chen-Yu Lee
Sayna Ebrahimi
CLL
LRM
105
1
0
24 Oct 2025
Pctx: Tokenizing Personalized Context for Generative Recommendation
Qiyong Zhong
Jiajie Su
Yunshan Ma
Julian McAuley
Yupeng Hou
104
0
0
24 Oct 2025
Data-Centric Lessons To Improve Speech-Language Pretraining
Vishaal Udandarao
Zhiyun Lu
Xuankai Chang
Yongqiang Wang
Violet Z. Yao
Albin Madapally Jose
Fartash Faghri
Josh Gardner
Chung-Cheng Chiu
124
0
0
22 Oct 2025
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang
Nyima Tashi
Fan Gao
Yutong Liu
J. Li
...
Guojie Tang
Xiangxiang Wang
Jia Zhang
Tsengdar J. Lee
Yongbin Yu
104
0
0
22 Oct 2025
Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation
Yasser Hamidullah
Koel Dutta Chowdury
Yusser Al Ghussin
Shakib Yazdani
Cennet Oguz
Josef van Genabith
C. España-Bonet
137
0
0
21 Oct 2025
See the Text: From Tokenization to Visual Reading
Ling Xing
Alex Jinpeng Wang
Rui Yan
Hongyu Qu
Zechao Li
Jinhui Tang
VLM
148
0
0
21 Oct 2025
Accelerating Vision Transformers with Adaptive Patch Sizes
Rohan Choudhury
JungEun Kim
Jeongseok Lee
Eunho Yang
László A. Jeni
Kishore Venkateshan
ViT
112
1
0
20 Oct 2025
Zero-Shot Performance Prediction for Probabilistic Scaling Laws
Viktoria Schram
Markus Hiller
Daniel Beck
Trevor Cohn
128
0
0
19 Oct 2025
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
Sibo Xiao
Jinyuan Fu
Zhongle Xie
Lidan Shou
AI4TS
119
0
0
17 Oct 2025
Selecting and Combining Large Language Models for Scalable Code Clone Detection
Muslim Chochlov
Gul Aftab Ahmed
James Vincent Patten
Yuanhua Han
Guoxian Lu
David Gregg
J. Buckley
125
0
0
17 Oct 2025
Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs
Parsa Hejabi
Elnaz Rahmati
Alireza S. Ziabari
Morteza Dehghani
AAML
LRM
120
0
0
16 Oct 2025
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
173
0
0
15 Oct 2025
Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation
Jiamin Chen
Yuchen Li
Xinyu Ma
X. Chen
Xiaokun Zhang
Shuaiqiang Wang
Chen Ma
D. Yin
RALM
LRM
159
0
0
15 Oct 2025
VaultGemma: A Differentially Private Gemma Model
Amer Sinha
Thomas Mesnard
Ryan McKenna
Daogao Liu
Christopher A. Choquette-Choo
...
Borja De Balle Pigem
Prem Eruvbetine
T. Warkentin
Armand Joulin
Ravi KumarAmer Sinha
FedML
MoE
VLM
MDE
278
2
0
15 Oct 2025
End-to-End Multi-Modal Diffusion Mamba
Chunhao Lu
Qiang Lu
Meichen Dong
Jake Luo
122
3
0
15 Oct 2025
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Hailay Teklehaymanot
Wolfgang Nejdl
88
0
0
14 Oct 2025
MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
Jianjin Wang
Runsong Zhao
Xiaoqian Liu
Yuan Ge
Ziqiang Xu
Tong Xiao
Shengxiang Gao
Z. Yu
Jingbo Zhu
92
0
0
11 Oct 2025
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Jinbin Zhang
Nasib Ullah
Erik Schultheis
Rohit Babbar
116
1
0
11 Oct 2025
Serialized EHR make for good text representations
Zhirong Chou
Quan Qin
Shi Li
77
0
0
11 Oct 2025
Hierarchical Scheduling for Multi-Vector Image Retrieval
Maoliang Li
K. Li
Yaoyang Liu
Jiayu Chen
Zihao Zheng
Yinjun Wu
Xiang Chen
112
0
0
10 Oct 2025
SkipSR: Faster Super Resolution with Token Skipping
Rohan Choudhury
Shanchuan Lin
Jianyi Wang
Hao Chen
Qi Zhao
Feng Cheng
Lu Jiang
Kris Kitani
László A. Jeni
SupR
189
0
0
09 Oct 2025
Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa
Taku Hasegawa
Kyosuke Nishida
Shinýa Yamaguchi
Tomoya Ohba
Tamao Sakao
Susumu Takeuchi
80
0
0
09 Oct 2025
Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
IEEE Access (IEEE Access), 2025
Kento Kawaharazuka
Jihoon Oh
Jun Yamada
Ingmar Posner
Yuke Zhu
LM&Ro
227
23
0
08 Oct 2025
Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework
Mosong Ma
Tania Stathaki
Michalis Lazarou
MedIm
GAN
225
0
0
07 Oct 2025
Latent Speech-Text Transformer
Yen-Ju Lu
Yashesh Gaur
Wei Zhou
Benjamin Muller
Jesus Villalba
...
Luke Zettlemoyer
Gargi Ghosh
Mike Lewis
Srinivasan Iyer
Duc Le
VLM
108
0
0
07 Oct 2025
Large Language Models Hallucination: A Comprehensive Survey
Aisha Alansari
Hamzah Luqman
HILM
LRM
441
1
0
05 Oct 2025
Multi Language Models for On-the-Fly Syntax Highlighting
Marco Edoardo Palma
Pooja Rani
Harald C. Gall
72
0
0
05 Oct 2025
Evaluating Embedding Frameworks for Scientific Domain
Nouman Ahmed
R. Wu
Victor Botev
128
0
0
03 Oct 2025
Litespark Technical Report: High-Throughput, Energy-Efficient LLM Training Framework
Nii Osae Osae Dade
Moinul Hossain Rahat
116
0
0
02 Oct 2025
1
2
3
4
...
40
41
42
Next