ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown
Nyonic Technical Report
Nyonic Technical Report
Junfeng Tian
Rui Wang
Cong Li
Yudong Zhou
Jun Liu
Jun Wang
152
1
0
24 Apr 2024
Multi-Head Mixture-of-Experts
Multi-Head Mixture-of-Experts
Xun Wu
Shaohan Huang
Wenhui Wang
Furu Wei
MoE
243
27
0
23 Apr 2024
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Kevin Slagle
216
14
0
22 Apr 2024
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Ruizhe Huang
Xiaohui Zhang
Zhaoheng Ni
Li Sun
Moto Hira
...
Vineel Pratap
Sanjeev Khudanpur
Shinji Watanabe
Daniel Povey
Sanjeev Khudanpur
353
12
0
22 Apr 2024
TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and
  Historical Languages
TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages
Aleksei Dorkin
Kairit Sirts
127
3
0
19 Apr 2024
Simultaneous Interpretation Corpus Construction by Large Language Models
  in Distant Language Pair
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Yusuke Sakai
Mana Makinae
Hidetaka Kamigaito
Taro Watanabe
226
5
0
18 Apr 2024
Neuron Specialization: Leveraging intrinsic task modularity for
  multilingual machine translation
Neuron Specialization: Leveraging intrinsic task modularity for multilingual machine translation
Shaomu Tan
Di Wu
Christof Monz
MoMe
303
21
0
17 Apr 2024
Language Model Cascades: Token-level uncertainty and beyond
Language Model Cascades: Token-level uncertainty and beyond
Neha Gupta
Harikrishna Narasimhan
Wittawat Jitkrittum
A. S. Rawat
A. Menon
Sanjiv Kumar
UQLM
443
90
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video
  Captioning
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
248
19
0
14 Apr 2024
TransformerFAM: Feedback attention is working memory
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
418
17
0
14 Apr 2024
The Role of Language Imbalance in Cross-lingual Generalisation: Insights
  from Cloned Language Experiments
The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments
Anton Schäfer
Haiqin Yang
Thomas Hofmann
Tiago Pimentel
Imanol Schlag
414
4
0
11 Apr 2024
RecurrentGemma: Moving Past Transformers for Efficient Open Language
  Models
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Aleksandar Botev
Soham De
Samuel L. Smith
Anushan Fernando
George-Christian Muraru
...
Koray Kavukcuoglu
Demis Hassabis
R. Hadsell
Yee Whye Teh
Nando de Frietas
VLMRALM
164
42
0
11 Apr 2024
Interactive Prompt Debugging with Sequence Salience
Interactive Prompt Debugging with Sequence Salience
Ian Tenney
Ryan Mullins
Bin Du
Shree Pandya
Minsuk Kahng
Lucas Dixon
LRM
177
5
0
11 Apr 2024
High-Dimension Human Value Representation in Large Language Models
High-Dimension Human Value Representation in Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Samuel Cahyawijaya
Delong Chen
Yejin Bang
Leila Khalatbari
Bryan Wilie
Ziwei Ji
Etsuko Ishii
Pascale Fung
619
11
0
11 Apr 2024
Analyzing the Performance of Large Language Models on Code Summarization
Analyzing the Performance of Large Language Models on Code SummarizationInternational Conference on Language Resources and Evaluation (LREC), 2024
Rajarshi Haldar
Anjali Narayan-Chen
197
34
0
10 Apr 2024
On the Effect of (Near) Duplicate Subwords in Language Modelling
On the Effect of (Near) Duplicate Subwords in Language ModellingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Anton Schäfer
Thomas Hofmann
Imanol Schlag
Tiago Pimentel
250
4
0
09 Apr 2024
Towards Robust Domain Generation Algorithm Classification
Towards Robust Domain Generation Algorithm ClassificationACM Asia Conference on Computer and Communications Security (AsiaCCS), 2024
Arthur Drichel
Marc Meyer
Ulrike Meyer
AAML
197
4
0
09 Apr 2024
Interplay of Machine Translation, Diacritics, and Diacritization
Interplay of Machine Translation, Diacritics, and Diacritization
Wei-Rui Chen
Ife Adebara
Muhammad Abdul-Mageed
270
2
0
09 Apr 2024
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model
Xinrun Du
Zhouliang Yu
Songyang Gao
Ding Pan
Yuyang Cheng
...
Tianyu Zheng
Xinchen Luo
Guorui Zhou
Lei Ma
Ge Zhang
311
28
0
05 Apr 2024
Training LLMs over Neurally Compressed Text
Training LLMs over Neurally Compressed Text
Brian Lester
Jaehoon Lee
A. Alemi
Jeffrey Pennington
Adam Roberts
Jascha Narain Sohl-Dickstein
Noah Constant
206
11
0
04 Apr 2024
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
SemGrasp: Semantic Grasp Generation via Language Aligned DiscretizationEuropean Conference on Computer Vision (ECCV), 2024
Kailin Li
Jingbo Wang
Lixin Yang
Cewu Lu
Bo Dai
256
32
0
04 Apr 2024
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium
  Approach for Binary Vulnerability Detection
Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability DetectionInternational Conference on Information and Knowledge Management (CIKM), 2024
Litao Li
Steven H. H. Ding
Andrew Walenstein
P. Charland
Benjamin C. M. Fung
160
1
0
03 Apr 2024
PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny
  Detection in Italian Tweets
PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny Detection in Italian TweetsInternational Conference on Language Resources and Evaluation (LREC), 2024
Arianna Muti
Federico Ruggeri
Cagri Toraman
Lorenzo Musetti
Samuel Algherini
Silvia Ronchi
G. Saretto
Caterina Zapparoli
Alberto Barrón-Cedeño
101
7
0
03 Apr 2024
PhonologyBench: Evaluating Phonological Skills of Large Language Models
PhonologyBench: Evaluating Phonological Skills of Large Language Models
Ashima Suvarna
Harshita Khandelwal
Nanyun Peng
LM&MA
296
6
0
03 Apr 2024
Revisiting subword tokenization: A case study on affixal negation in
  large language models
Revisiting subword tokenization: A case study on affixal negation in large language modelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Thinh Hung Truong
Yulia Otmakhova
Karin Verspoor
Trevor Cohn
Timothy Baldwin
207
4
0
03 Apr 2024
Low-resource neural machine translation with morphological modeling
Low-resource neural machine translation with morphological modeling
Antoine Nzeyimana
261
12
0
03 Apr 2024
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory
  Speech Recognition
BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
A. Haliassos
Andreas Zinonos
Rodrigo Mira
Stavros Petridis
Maja Pantic
VLMSSLAI4TS
251
21
0
02 Apr 2024
MotionChain: Conversational Motion Controllers via Multimodal Prompts
MotionChain: Conversational Motion Controllers via Multimodal PromptsEuropean Conference on Computer Vision (ECCV), 2024
Biao Jiang
Xin Chen
C. Zhang
Fukun Yin
Zhuoyuan Li
Gang Yu
Jiayuan Fan
VGenLRM
276
21
0
02 Apr 2024
Release of Pre-Trained Models for the Japanese Language
Release of Pre-Trained Models for the Japanese LanguageInternational Conference on Language Resources and Evaluation (LREC), 2024
Kei Sawada
Tianyu Zhao
Makoto Shing
Kentaro Mitsui
Akio Kaga
Yukiya Hono
Toshiaki Wakatsuki
Koh Mitsuda
206
29
0
02 Apr 2024
Scaling Properties of Speech Language Models
Scaling Properties of Speech Language Models
Santiago Cuervo
R. Marxer
280
21
0
31 Mar 2024
A Systematic Analysis of Subwords and Cross-Lingual Transfer in
  Multilingual Translation
A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation
Francois Meyer
Jan Buys
306
1
0
29 Mar 2024
IDGenRec: LLM-RecSys Alignment with Textual ID Learning
IDGenRec: LLM-RecSys Alignment with Textual ID Learning
Juntao Tan
Shuyuan Xu
Qingfeng Lan
Yingqiang Ge
Zelong Li
Zelong Li
179
76
0
27 Mar 2024
CYCLE: Learning to Self-Refine the Code Generation
CYCLE: Learning to Self-Refine the Code Generation
Yangruibo Ding
Marcus J. Min
Gail E. Kaiser
Baishakhi Ray
243
62
0
27 Mar 2024
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote
  Sensing Image Understanding
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
Run Shao
Zhaoyang Zhang
Chao Tao
Yunsheng Zhang
Chengli Peng
Haifeng Li
VLM
306
14
0
27 Mar 2024
Can Language Beat Numerical Regression? Language-Based Multimodal
  Trajectory Prediction
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Inhwan Bae
Junoh Lee
Hae-Gon Jeon
359
53
0
27 Mar 2024
mALBERT: Is a Compact Multilingual BERT Model Still Worth It?
mALBERT: Is a Compact Multilingual BERT Model Still Worth It?
Christophe Servan
Sahar Ghannay
Sophie Rosset
163
1
0
27 Mar 2024
Provably Secure Disambiguating Neural Linguistic Steganography
Provably Secure Disambiguating Neural Linguistic Steganography
Yuang Qi
Kejiang Chen
Kai Zeng
Weiming Zhang
Neng H. Yu
144
9
0
26 Mar 2024
Making Sentence Embeddings Robust to User-Generated Content
Making Sentence Embeddings Robust to User-Generated Content
Lydia Nishimwe
Benoît Sagot
Rachel Bawden
3DV
197
1
0
25 Mar 2024
Understanding Emergent Abilities of Language Models from the Loss Perspective
Understanding Emergent Abilities of Language Models from the Loss PerspectiveNeural Information Processing Systems (NeurIPS), 2024
Zhengxiao Du
Aohan Zeng
Yuxiao Dong
Jie Tang
UQCVLRM
398
77
0
23 Mar 2024
AI for Biomedicine in the Era of Large Language Models
AI for Biomedicine in the Era of Large Language Models
Zhenyu Bi
Sajib Acharjee Dip
Daniel Hajialigol
Sindhura Kommu
Hanwen Liu
Meng Lu
Xuan Wang
LM&MAAI4CE
195
10
0
23 Mar 2024
Adapprox: Adaptive Approximation in Adam Optimization via Randomized
  Low-Rank Matrices
Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices
Pengxiang Zhao
Ping Li
Yingjie Gu
Yi Zheng
Stephan Ludger Kölker
Zhefeng Wang
Xiaoming Yuan
182
7
0
22 Mar 2024
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual
  Academic Lecture Dataset
M3^33AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset
Zhe Chen
Heyang Liu
Wenyi Yu
Guangzhi Sun
Hongcheng Liu
Ji Wu
Chao Zhang
Yu Wang
Yanfeng Wang
VGen
175
3
0
21 Mar 2024
Different Tokenization Schemes Lead to Comparable Performance in Spanish
  Number Agreement
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Catherine Arnett
Pamela D. Rivière
Tyler A. Chang
Sean Trott
239
5
0
20 Mar 2024
Advanced Long-Content Speech Recognition With Factorized Neural
  Transducer
Advanced Long-Content Speech Recognition With Factorized Neural Transducer
Xun Gong
Yu Wu
Jinyu Li
Shujie Liu
Rui Zhao
Xie Chen
Yanmin Qian
228
14
0
20 Mar 2024
Self-generated Replay Memories for Continual Neural Machine Translation
Self-generated Replay Memories for Continual Neural Machine Translation
Michele Resta
Davide Bacciu
CLL
239
6
0
19 Mar 2024
Comparing Explanation Faithfulness between Multilingual and Monolingual
  Fine-tuned Language Models
Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models
Zhixue Zhao
Nikolaos Aletras
224
10
0
19 Mar 2024
Enhancing Taiwanese Hokkien Dual Translation by Exploring and
  Standardizing of Four Writing Systems
Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing SystemsInternational Conference on Language Resources and Evaluation (LREC), 2024
Bo-Han Lu
Yi-Hsuan Lin
En-Shiun Annie Lee
Richard Tzong-Han Tsai
163
2
0
18 Mar 2024
Optimizing Language Augmentation for Multilingual Large Language Models:
  A Case Study on Korean
Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean
Changsu Choi
Yongbin Jeong
Seoyoon Park
Inho Won
HyeonSeok Lim
...
Yiseul Lee
HyeJin Lee
Younggyun Hahm
Hansaem Kim
Kyungtae Lim
282
23
0
16 Mar 2024
Exploring Chinese Humor Generation: A Study on Two-Part Allegorical
  Sayings
Exploring Chinese Humor Generation: A Study on Two-Part Allegorical Sayings
Rongwu Xu
288
4
0
16 Mar 2024
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual
  Language Modeling
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
305
30
0
15 Mar 2024
Previous
123...91011...404142
Next