ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown
Regression Language Models for Code
Regression Language Models for Code
Yash Akhauri
Xingyou Song
Arissa Wongpanich
Bryan Lewandowski
Mohamed S. Abdelfattah
184
3
0
30 Sep 2025
Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models
Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models
Rokas Bendikas
Daniel Dijkman
Markus Peschl
Sanjay Haresh
Pietro Mazzaglia
162
1
0
28 Sep 2025
A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography
A High-Capacity and Secure Disambiguation Algorithm for Neural Linguistic Steganography
Yapei Feng
Feng Jiang
Shanhao Wu
Hua Zhong
142
0
0
26 Sep 2025
Partial Parameter Updates for Efficient Distributed Training
Partial Parameter Updates for Efficient Distributed Training
Anastasiia Filippova
Angelos Katharopoulos
David Grangier
Ronan Collobert
FedML
132
0
0
26 Sep 2025
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Tokenization and Representation Biases in Multilingual Models on Dialectal NLP Tasks
Vani Kanjirangat
Tanja Samardžić
Ljiljana Dolamic
Fabio Rinaldi
84
1
0
24 Sep 2025
Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks
Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks
Hailay Teklehaymanot
Gebrearegawi Gidey
Wolfgang Nejdl
104
0
0
24 Sep 2025
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models
Julie Kallini
Dan Jurafsky
Christopher Potts
Martijn Bartelds
185
0
0
23 Sep 2025
Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition
Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition
Boddu Sri Pavan
Boddu Swathi Sree
52
0
0
23 Sep 2025
DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment
DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment
Abderrahmane Issam
Yusuf Can Semerci
Jan Scholtes
Gerasimos Spanakis
88
0
0
23 Sep 2025
Cross-Attention is Half Explanation in Speech-to-Text Models
Cross-Attention is Half Explanation in Speech-to-Text Models
Sara Papi
Dennis Fucci
Marco Gaido
Matteo Negri
L. Bentivogli
LRM
161
0
0
22 Sep 2025
Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
Wenhao Zhuang
Yuan Sun
Xiaobing Zhao
108
0
0
22 Sep 2025
CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages
CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource LanguagesInternational Conference on Computational Linguistics (COLING), 2025
Wenhao Zhuang
Yuan Sun
108
1
0
21 Sep 2025
Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization
Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization
Yun Tang
Cindy Tseng
96
0
0
19 Sep 2025
Deep learning and abstractive summarisation for radiological reports: an empirical study for adapting the PEGASUS models' family with scarce data
Deep learning and abstractive summarisation for radiological reports: an empirical study for adapting the PEGASUS models' family with scarce data
Claudio Benzoni
Martina Langhals
Martin Boeker
Luise Modersohn
Máté E. Maros
MedIm
107
0
0
18 Sep 2025
Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha
Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha
Tandin Wangchuk
Tad Gonsalves
84
0
0
18 Sep 2025
Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses
Multi-Channel Differential ASR for Robust Wearer Speech Recognition on Smart Glasses
Yufeng Yang
Yiteng Huang
Yong Xu
Li Wan
Suwon Shon
...
Zhaojun Yang
Olivier Siohan
Yue Liu
Ming Sun
Florian Metze
120
0
0
17 Sep 2025
Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Monica Sekoyan
Nithin Rao Koluguri
Nune Tadevosyan
Piotr .Zelasko
Travis M. Bartley
Nick Karpov
Jagadeesh Balam
Boris Ginsburg
VLM
170
5
0
17 Sep 2025
Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR
Data-independent Beamforming for End-to-end Multichannel Multi-speaker ASR
Can Cui
P. Magron
M. Sadeghi
Emmanuel Vincent
112
0
0
12 Sep 2025
Long Context Automated Essay Scoring with Language Models
Long Context Automated Essay Scoring with Language Models
Christopher Ormerod
Gitit Kehat
123
0
0
12 Sep 2025
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
Hailay Teklehaymanot
Dren Fazlija
Wolfgang Nejdl
114
0
0
10 Sep 2025
Continuous Audio Language Models
Continuous Audio Language Models
Simon Rouard
Manu Orsini
Axel Roebel
Neil Zeghidour
Alexandre Défossez
AuLLMKELM
263
2
0
08 Sep 2025
Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
Optimal Multi-Task Learning at Regularization Horizon for Speech Translation Task
JungHo Jung
Junhyun Lee
94
0
0
04 Sep 2025
Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving
Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving
Mingyi Wang
Jingke Wang
Tengju Ye
Junbo Chen
Kaicheng Yu
AILaw
172
1
0
02 Sep 2025
NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task
NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task
Bashar Talafha
Hawau Olamide Toyin
Peter Sullivan
AbdelRahim Elmadany
Abdurrahman Juma
...
Hamad Alshehhi
Hanan Aldarmaki
Mustafa Jarrar
Nizar Habash
Muhammad Abdul-Mageed
186
4
0
02 Sep 2025
Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
Ruiyi Yan
Yugo Murawaki
WaLM
163
0
0
28 Aug 2025
Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach
Han Yang
Jian Lan
Yihong Liu
Hinrich Schutze
Thomas Seidl
AAML
72
0
0
28 Aug 2025
Heterogeneous Self-Supervised Acoustic Pre-Training with Local Constraints
Heterogeneous Self-Supervised Acoustic Pre-Training with Local Constraints
Xiaodong Cui
A. F. M. Saif
Brian Kingsbury
Tianyi Chen
SSL
189
0
0
27 Aug 2025
Insights into User Interface Innovations from a Design Thinking Workshop at deRSE25
Insights into User Interface Innovations from a Design Thinking Workshop at deRSE25
Maximilian Frank
Simon Lund
97
0
0
26 Aug 2025
It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs
Yue Li
Zhixue Zhao
Carolina Scarton
141
2
0
26 Aug 2025
Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in Multimodal LLMs
Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in Multimodal LLMs
Supratik Sarkar
Swagatam Das
139
1
0
26 Aug 2025
Stack Trace-Based Crash Deduplication with Transformer Adaptation
Stack Trace-Based Crash Deduplication with Transformer Adaptation
Md Afif Al Mamun
Gias Uddin
Lan Xia
Longyu Zhang
104
0
0
26 Aug 2025
Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Qingjie Zhang
Di Wang
Haoting Qian
Liu Yan
Tianwei Zhang
Ke Xu
Qi Li
Minlie Huang
Hewu Li
Han Qiu
95
1
0
25 Aug 2025
JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus
JaParaPat: A Large-Scale Japanese-English Parallel Patent Application CorpusInternational Conference on Language Resources and Evaluation (LREC), 2025
Masaaki Nagata
Katsuki Chousa
Norihito Yasuda
75
4
0
22 Aug 2025
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
Hanling Zhang
Yayu Zhou
Tongcheng Fang
Zhihang Yuan
Guohao Dai
Yu Wang
Yu Wang
75
0
0
21 Aug 2025
Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
Mukhammadsaid Mamasaidov
Azizullah Aral
Abror Shopulatov
Mironshoh Inomjonov
80
1
0
20 Aug 2025
CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models
Catherine Glossop
William Chen
Arjun Bhorkar
Dhruv Shah
Sergey Levine
LM&Ro
194
5
0
19 Aug 2025
Tokens with Meaning: A Hybrid Tokenization Approach for NLP
Tokens with Meaning: A Hybrid Tokenization Approach for NLP
M. Ali Bayram
Ali Arda Fincan
Ahmet Semih G"um"uş
Sercan Karakaş
Banu Diri
Savaş Yıldırım
Demircan Çelik
85
0
0
19 Aug 2025
Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi
Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı AnaliziSignal Processing and Communications Applications Conference (SIU), 2025
M. Ali Bayram
Ali Arda Fincan
Ahmet Semih G"um"uş
Sercan Karakaş
Banu Diri
Savaş Yıldırım
64
0
0
18 Aug 2025
SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
Andrei-Valentin Tanase
Elena Pelican
125
1
0
16 Aug 2025
Large Language Models for Summarizing Czech Historical Documents and Beyond
Large Language Models for Summarizing Czech Historical Documents and BeyondInternational Conference on Agents and Artificial Intelligence (ICAART), 2025
Václav Tran
Jakub Šmíd
J. Martínek
Ladislav Lenc
Pavel Král
130
1
0
14 Aug 2025
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing
A. F. M. Saif
Lisha Chen
Xiaodong Cui
Songtao Lu
Brian Kingsbury
Tianyi Chen
93
0
0
12 Aug 2025
Special-Character Adversarial Attacks on Open-Source Language Model
Special-Character Adversarial Attacks on Open-Source Language Model
Ephraiem Sarabamoun
117
1
0
12 Aug 2025
DeCAL Tokenwise Compression
DeCAL Tokenwise Compression
Sameer Panwar
149
0
0
11 Aug 2025
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula
Sandipan Dandapat
D. Sharma
Parameswari Krishnamurthy
235
0
0
11 Aug 2025
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models
Tomohiro Sawada
Kartik Goyal
MoMe
97
0
0
08 Aug 2025
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Mehrdad Zakershahrak
Samira Ghodratnama
VLM
72
0
0
07 Aug 2025
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design
Aamod Thakur
Ajay Nagpal
Atharva Savarkar
Kundeshwar Pundalik
Siddhesh Dosi
Piyush Sawarkar
Viraj Thakur
Rohit Saluja
Maunendra Sankar Desarkar
Ganesh Ramakrishnan
104
1
0
03 Aug 2025
Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law
Yanjin He
Qingkai Zeng
Meng Jiang
172
1
0
30 Jul 2025
Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
Aarón Galiano-Jiménez
Juan Antonio Pérez-Ortiz
F. Sánchez-Martínez
Víctor M. Sánchez-Cartagena
201
0
0
29 Jul 2025
Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation
Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation
Sumit Singh
Rohit Mishra
U. Tiwary
103
0
0
21 Jul 2025
Previous
12345...404142
Next