ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,061 papers shown
Title
Large language models as uncertainty-calibrated optimizers for experimental discovery
Large language models as uncertainty-calibrated optimizers for experimental discovery
Bojana Ranković
Ryan-Rhys Griffiths
P. Schwaller
BDL
1.1K
3
0
08 Apr 2025
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image RestorationComputer Vision and Pattern Recognition (CVPR), 2025
Yunlong Lin
Zixu Lin
Zhaodong Sun
Panwang Pan
Chaofan Li
Sixiang Chen
Yeying Jin
Wenbo Li
Xinghao Ding
346
13
0
05 Apr 2025
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic IDACM Conference on Recommender Systems (RecSys), 2025
Carolina Zheng
Minhui Huang
Dmitrii Pedchenko
Kaushik Rangadurai
Shuaiqiang Wang
...
Yiping Han
Lin Yang
Hangjun Xu
Rong Jin
Shuang Yang
250
16
0
02 Apr 2025
Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction
Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction
Daniel Becking
Ingo Friese
Karsten Müller
Thomas Buchholz
Mandy Galkow-Schneider
Wojciech Samek
D. Marpe
109
0
0
02 Apr 2025
SocialGen: Modeling Multi-Human Social Interaction with Language Models
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu
Juze Zhang
Changan Chen
Tiange Xiang
Yusu Fang
Juan Carlos Niebles
Ehsan Adeli
VGen
240
5
0
28 Mar 2025
Tokenization of Gaze Data
Tokenization of Gaze Data
Tim Rolff
Jurik Karimian
Niklas Hypki
S. Schmidt
Markus Lappe
Frank Steinicke
261
0
0
28 Mar 2025
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Alex Jinpeng Wang
Linjie Li
Zhiyong Yang
Lijuan Wang
Min Li
DiffM
242
2
0
26 Mar 2025
Named Entity Recognition in Context
Named Entity Recognition in Context
Colin Brisson
Ayoub Kahfy
Marc Bui
Frédéric Constant
295
0
0
26 Mar 2025
Gemma 3 Technical Report
Gemma 3 Technical Report
Gemma Team
Aishwarya B Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
...
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
VLM
505
728
0
25 Mar 2025
Payload-Aware Intrusion Detection with CMAE and Large Language Models
Payload-Aware Intrusion Detection with CMAE and Large Language ModelsACM Transactions on Privacy and Security (TOPS), 2025
Yongcheol Kim
Chanjae Lee
Young Yoon
212
3
0
23 Mar 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
M. Bommarito
Daniel Martin Katz
Jillian Bommarito
171
3
0
21 Mar 2025
Self-Vocabularizing Training for Neural Machine Translation
Self-Vocabularizing Training for Neural Machine TranslationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Pin-Jie Lin
Ernie Chang
Yangyang Shi
Vikas Chandra
335
0
0
18 Mar 2025
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Lijie Fan
Luming Tang
Siyang Qin
Tianhong Li
Xuan S. Yang
...
Tao Zhu
Michael Rubinstein
Michalis Raptis
Deqing Sun
Radu Soricut
291
25
0
17 Mar 2025
SuperBPE: Space Travel for Language Models
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
436
23
0
17 Mar 2025
Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility
Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility
Jacob Chmura
Jonah Dauvet
Sebastian Sabry
171
0
0
16 Mar 2025
Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
Julian Spravil
Sebastian Houben
Sven Behnke
VLM
535
0
0
12 Mar 2025
BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions
Chi Hang
Ruiqi Deng
L. Jiang
Zihao Yang
Anton Alyakin
Daniel Alber
E. Oermann
AI4MHLM&MA
168
0
0
06 Mar 2025
On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
On the Acquisition of Shared Grammatical Representations in Bilingual Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Catherine Arnett
Tyler A. Chang
J. Michaelov
Benjamin Bergen
270
6
0
05 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALMELM
968
10
0
04 Mar 2025
SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
Nikolay Blagoev
Lydia Yiyu Chen
Oğuzhan Ersoy
216
3
0
27 Feb 2025
A City of Millions: Mapping Literary Social Networks At Scale
A City of Millions: Mapping Literary Social Networks At Scale
Sil Hamilton
Rebecca M. M. Hicke
David M. Mimno
Matthew Wilkens
GNN
961
1
0
26 Feb 2025
(Mis)Fitting: A Survey of Scaling Laws
(Mis)Fitting: A Survey of Scaling Laws
Margaret Li
Sneha Kudugunta
Luke Zettlemoyer
392
11
0
26 Feb 2025
Scaling Laws for Downstream Task Performance in Machine Translation
Scaling Laws for Downstream Task Performance in Machine TranslationInternational Conference on Learning Representations (ICLR), 2024
Berivan Isik
Natalia Ponomareva
Hussein Hazimeh
Dimitris Paparas
Sergei Vassilvitskii
Sanmi Koyejo
291
23
0
24 Feb 2025
Deterministic Reversible Data Augmentation for Neural Machine Translation
Deterministic Reversible Data Augmentation for Neural Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Jiashu Yao
Heyan Huang
Zeming Liu
Yuhang Guo
358
0
0
21 Feb 2025
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Lior Belenki
Alekh Agarwal
Tianze Shi
Kristina Toutanova
MoE
200
0
0
21 Feb 2025
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Eric Egli
Matteo Manica
Jannis Born
133
1
0
21 Feb 2025
Lost in Space: Finding the Right Tokens for Structured Output
Lost in Space: Finding the Right Tokens for Structured Output
Sil Hamilton
David Mimno
331
0
0
20 Feb 2025
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
Burc Gokden
309
0
0
19 Feb 2025
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Jian Jia
Jingtong Gao
Ben Xue
Junhao Wang
Qingpeng Cai
Quan Chen
Xiangyu Zhao
Peng Jiang
Kun Gai
OffRL
293
6
0
18 Feb 2025
Baichuan-M1: Pushing the Medical Capability of Large Language Models
Binghai Wang
Haizhou Zhao
Huozhi Zhou
Liang Song
Mingyu Xu
...
Yan Zhang
Yifei Duan
Yuyan Zhou
Zhi-Ming Ma
Zhikai Wu
LM&MAELMAI4MH
358
31
0
18 Feb 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
TituLLMs: A Family of Bangla LLMs with Comprehensive BenchmarkingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
485
2
0
16 Feb 2025
Enhancing LLM Character-Level Manipulation via Divide and Conquer
Enhancing LLM Character-Level Manipulation via Divide and Conquer
Zhen Xiong
Yujun Cai
Bryan Hooi
Nanyun Peng
Kai-Wei Chang
Zhecheng Li
358
0
0
12 Feb 2025
Scaling Embedding Layers in Language Models
Scaling Embedding Layers in Language Models
Da Yu
Edith Cohen
Badih Ghazi
Yangsibo Huang
Pritish Kamath
Ravi Kumar
Daogao Liu
Chiyuan Zhang
480
6
0
03 Feb 2025
A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
Yacouba Kaloga
Shashi Kumar
P. Motlícek
Ina Kodrasi
OT
343
0
0
03 Feb 2025
Vision-centric Token Compression in Large Language Model
Vision-centric Token Compression in Large Language Model
Ling Xing
Alex Jinpeng Wang
Rui Yan
Xiangbo Shu
Jinhui Tang
VLM
574
3
0
02 Feb 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Gaurav Jain
Roy Schwartz
Moshe Wasserblat
616
9
0
31 Jan 2025
Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation
Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation
Muhammed Yusuf Kocyigit
Eleftheria Briakou
Daniel Deutsch
Jiaming Luo
Colin Cherry
Markus Freitag
209
4
0
30 Jan 2025
BLR-MoE: Boosted Language-Routing Mixture of Experts for Domain-Robust Multilingual E2E ASR
BLR-MoE: Boosted Language-Routing Mixture of Experts for Domain-Robust Multilingual E2E ASRIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Guodong Ma
Wenxuan Wang
Lifeng Zhou
Yuting Yang
Yuke Li
Binbin Du
MoE
260
3
0
22 Jan 2025
aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing
aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing
Siyuan Jiang
Jia Li
He Zong
Huanyu Liu
Hao Zhu
...
Wei Ning
G. Wang
Yihong Dong
Kechi Zhang
Ge Li
ALM
262
2
0
17 Jan 2025
Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition
Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Takaaki Hori
Martin Kocour
Adnan Haider
Erik McDermott
Xiaodan Zhuang
AuLLM
154
5
0
17 Jan 2025
ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and Vietnamese-Lao language pair
ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and Vietnamese-Lao language pairJournal of Computer Science and Cybernetics (JCSC), 2025
Hong-Viet Tran
Minh-Quy Nguyen
Van-Vinh Nguyen
MoE
90
0
0
15 Jan 2025
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Sheng Zhang
Yanbo Xu
Naoto Usuyama
Hanwen Xu
J. Bagga
...
Carlo Bifulco
M. Lungren
Tristan Naumann
Sheng Wang
Hoifung Poon
LM&MAMedIm
733
420
0
10 Jan 2025
Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zhi Qu
Yiran Wang
Jiannan Mao
Chenchen Ding
Hideki Tanaka
Masao Utiyama
Taro Watanabe
LRM
343
1
0
06 Jan 2025
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message PassingProceedings of the VLDB Endowment (PVLDB), 2025
Jinqiao Wang
Kai Wang
Yanzhe Zhang
Wenjie Zhang
Xiwei Xu
Xuemin Lin
239
9
0
04 Jan 2025
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Prepending or Cross-Attention for Speech-to-Text? An Empirical ComparisonNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Tsz Kin Lam
Marco Gaido
Sara Papi
L. Bentivogli
Barry Haddow
405
3
0
04 Jan 2025
A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation
A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech TranslationIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Xiaoqian Liu
Yangfan Du
Jiadong Wang
Yuan Ge
Chen Xu
Tong Xiao
Guocheng Chen
Jingbo Zhu
330
0
0
31 Dec 2024
ImagePiece: Content-aware Re-tokenization for Efficient Image
  Recognition
ImagePiece: Content-aware Re-tokenization for Efficient Image RecognitionAAAI Conference on Artificial Intelligence (AAAI), 2024
Seungdong Yoa
Seungjun Lee
Hyeseung Cho
Bumsoo Kim
Woohyung Lim
ViT
193
1
0
21 Dec 2024
ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
William Jongwon Han
Chaojing Duan
M. Rosenberg
Emerson Liu
Ding Zhao
396
3
0
18 Dec 2024
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
Extending LLMs to New Languages: A Case Study of Llama and Persian AdaptationInternational Conference on Computational Linguistics (COLING), 2024
Samin Mahdizadeh Sani
Pouya Sadeghi
Thuy-Trang Vu
Yadollah Yaghoobzadeh
Gholamreza Haffari
395
5
0
17 Dec 2024
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D
  Human Motion
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human MotionComputer Vision and Pattern Recognition (CVPR), 2024
Changan Chen
Juze Zhang
S. K. Lakshmikanth
Yusu Fang
Ruizhi Shao
Gordon Wetzstein
L. Fei-Fei
Ehsan Adeli
VGen
332
16
0
13 Dec 2024
Previous
12345...404142
Next