ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D
  Human Motion
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human MotionComputer Vision and Pattern Recognition (CVPR), 2024
Changan Chen
Juze Zhang
S. K. Lakshmikanth
Yusu Fang
Ruizhi Shao
Gordon Wetzstein
L. Fei-Fei
Ehsan Adeli
VGen
356
16
0
13 Dec 2024
Efficient Continual Pre-training of LLMs for Low-resource Languages
Efficient Continual Pre-training of LLMs for Low-resource LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Arijit Nag
Soumen Chakrabarti
Animesh Mukherjee
Niloy Ganguly
287
2
0
13 Dec 2024
Multi-Head Encoding for Extreme Label Classification
Multi-Head Encoding for Extreme Label ClassificationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Daojun Liang
Haixia Zhang
Dongfeng Yuan
Minggao Zhang
259
0
0
13 Dec 2024
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
Davor Lauc
220
1
0
12 Dec 2024
Scaling Sequential Recommendation Models with Transformers
Scaling Sequential Recommendation Models with TransformersAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
Pablo Zivic
Hernán Ceferino Vázquez
Jorge Sanchez
OffRLLRM
291
19
0
10 Dec 2024
Representation Purification for End-to-End Speech Translation
Representation Purification for End-to-End Speech TranslationInternational Conference on Computational Linguistics (COLING), 2024
Chengwei Zhang
Yue Zhou
Rui Zhao
Yidong Chen
Xiaodong Shi
182
4
0
05 Dec 2024
From Language Models over Tokens to Language Models over Characters
From Language Models over Tokens to Language Models over Characters
Tim Vieira
Ben LeBrun
Mario Giulianelli
Juan Luis Gastaldi
Brian DuSell
John Terilla
Timothy J. O'Donnell
Robert Bamler
447
21
0
04 Dec 2024
Improving Language Transfer Capability of Decoder-only Architecture in
  Multilingual Neural Machine Translation
Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation
Zhi Qu
Yiran Wang
Chenchen Ding
Hideki Tanaka
Masao Utiyama
Taro Watanabe
LRM
136
0
0
03 Dec 2024
Yi-Lightning Technical Report
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
708
10
0
02 Dec 2024
A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in
  Time Series
A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series
Xiangkai Ma
Xiaobin Hong
Wenzhong Li
Sanglu Lu
AI4TS
272
0
0
01 Dec 2024
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
Ali Shiraee Kasmaee
Mohammad Khodadad
Mohammad Arshi Saloot
Nick Sherck
Stephen Dokas
H. Mahyar
Soheila Samiee
ELM
1.3K
8
0
30 Nov 2024
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of
  Subword Tokenization Methods
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization MethodsIEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2024
Burak Suyunu
Enes Taylan
Arzucan Özgür
234
4
0
26 Nov 2024
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
681
0
0
23 Nov 2024
Context-Aware Multimodal Pretraining
Context-Aware Multimodal PretrainingComputer Vision and Pattern Recognition (CVPR), 2024
Karsten Roth
Zeynep Akata
Dima Damen
Ivana Balazevic
Olivier J. Hénaff
VLM
353
4
0
22 Nov 2024
Why do language models perform worse for morphologically complex
  languages?
Why do language models perform worse for morphologically complex languages?
Catherine Arnett
Benjamin Bergen
230
32
0
21 Nov 2024
The Master-Slave Encoder Model for Improving Patent Text Summarization:
  A New Approach to Combining Specifications and Claims
The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims
Shu Zhou
Xin Wang
Zhengda Zhou
Haohan Yi
Xuhui Zheng
Hao Wan
272
2
0
21 Nov 2024
Watermark under Fire: A Robustness Evaluation of LLM Watermarking
Watermark under Fire: A Robustness Evaluation of LLM Watermarking
Jiacheng Liang
Zian Wang
Lauren Hong
R. Beyah
Ting Wang
AAML
545
0
0
20 Nov 2024
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved
  Visual Data Generation
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
Tim Elsner
Paula Usinger
Julius Nehring-Wirxel
Gregor Kobsik
Victor Czech
Yanjiang He
I. Lim
Leif Kobbelt
255
1
0
15 Nov 2024
Xmodel-1.5: An 1B-scale Multilingual LLM
Xmodel-1.5: An 1B-scale Multilingual LLM
Wang Qun
Liu Yang
Lin Qingquan
Jiang Ling
LRM
351
0
0
15 Nov 2024
Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide
Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide
Márton Szép
Daniel Rueckert
Rüdiger von Eisenhart-Rothe
Florian Hinterwimmer
SyDaALM
576
6
0
14 Nov 2024
Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for
  Speech Recognition
Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech RecognitionSpoken Language Technology Workshop (SLT), 2024
Yoshiki Masuyama
Koichi Miyazaki
Masato Murata
Mamba
264
6
0
11 Nov 2024
When are 1.58 bits enough? A Bottom-up Exploration of BitNet
  Quantization
When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization
Jacob Nielsen
Lukas Galke
Peter Schneider-Kamp
MQ
228
1
0
08 Nov 2024
Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Miguel Moura Ramos
Tomás Almeida
Daniel Vareta
Filipe Azevedo
Sweta Agrawal
Patrick Fernandes
Marcely Zanon Boito
471
7
0
08 Nov 2024
Deploying Multi-task Online Server with Large Language Model
Deploying Multi-task Online Server with Large Language ModelInternational Conference on Computational Linguistics (COLING), 2024
Yincen Qu
Chao Ma
Xiangying Dai
Hui Zhou
Yiting Wu
Hengyue Liu
244
0
0
06 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Classification Done Right for Vision-Language Pre-TrainingNeural Information Processing Systems (NeurIPS), 2024
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIPVLM
419
7
0
05 Nov 2024
Unified Speech Recognition: A Single Model for Auditory, Visual, and
  Audiovisual Inputs
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual InputsNeural Information Processing Systems (NeurIPS), 2024
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
Maja Pantic
SSL
339
14
0
04 Nov 2024
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
Dennis Fucci
Marco Gaido
Beatrice Savoldi
Matteo Negri
Mauro Cettolo
L. Bentivogli
585
5
0
03 Nov 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine TranslationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Langlin Huang
Mengyu Bu
Yang Feng
255
0
0
03 Nov 2024
Optimizing Contextual Speech Recognition Using Vector Quantization for
  Efficient Retrieval
Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient RetrievalIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024
Nikolaos Flemotomos
Roger Hsiao
P. Swietojanski
Takaaki Hori
Dogan Can
Xiaodan Zhuang
478
3
0
01 Nov 2024
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Nam V. Nguyen
Thong T. Doan
Luong Tran
Van Nguyen
Quang Pham
MoE
604
4
0
01 Nov 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
MrT5: Dynamic Token Merging for Efficient Byte-level Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
416
14
0
28 Oct 2024
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers
  for Underrepresented Languages
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Artur Kiulian
Anton Polishko
M. Khandoga
Yevhen Kostiuk
Guillermo Gabrielli
...
Hrishikesh Garud
Wendy Wing Yee Mak
Dmytro Chaplynskyi
Selma Belhadj Amor
Grigol Peradze
212
4
0
24 Oct 2024
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging
  Small LMs
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
A. S. Rawat
Veeranjaneyulu Sadhanala
Afshin Rostamizadeh
Ayan Chakrabarti
Wittawat Jitkrittum
...
Rakesh Shivanna
Sashank J. Reddi
A. Menon
Rohan Anil
Sanjiv Kumar
465
10
0
24 Oct 2024
Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation
Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation
Krzysztof Ociepa
Łukasz Flis
Krzysztof Wróbel
Adrian Gwoździej
Remigiusz Kinas
186
6
0
24 Oct 2024
Scalable Influence and Fact Tracing for Large Language Model Pretraining
Scalable Influence and Fact Tracing for Large Language Model PretrainingInternational Conference on Learning Representations (ICLR), 2024
Tyler A. Chang
Dheeraj Rajagopal
Tolga Bolukbasi
Lucas Dixon
Ian Tenney
TDI
307
16
0
22 Oct 2024
PLDR-LLM: Large Language Model from Power Law Decoder Representations
PLDR-LLM: Large Language Model from Power Law Decoder Representations
Burc Gokden
140
2
0
22 Oct 2024
Methods of improving LLM training stability
Methods of improving LLM training stability
Oleg Rybakov
Mike Chrzanowski
Peter Dykas
Jinze Xue
Ben Lanir
211
8
0
22 Oct 2024
Action abstractions for amortized sampling
Action abstractions for amortized samplingInternational Conference on Learning Representations (ICLR), 2024
Oussama Boussif
Léna Néhale Ezzine
J. Viviano
Michał Koziarski
Moksh Jain
Nikolay Malkin
Emmanuel Bengio
Rim Assouel
Yoshua Bengio
194
2
0
19 Oct 2024
Fluid: Scaling Autoregressive Text-to-image Generative Models with
  Continuous Tokens
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous TokensInternational Conference on Learning Representations (ICLR), 2024
Lijie Fan
Tianhong Li
Siyang Qin
Yuanzhen Li
Chen Sun
Michael Rubinstein
Deqing Sun
Kaiming He
Yonglong Tian
VLMDiffM
325
110
0
17 Oct 2024
MotionBank: A Large-scale Video Motion Benchmark with Disentangled
  Rule-based Annotations
MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations
Liang Xu
Shaoyang Hua
Zili Lin
Yifan Liu
Feipeng Ma
Yichao Yan
Xin Jin
Xiaokang Yang
Wenjun Zeng
VGen
252
13
0
17 Oct 2024
Nominal Class Assignment in Swahili: A Computational Account
Nominal Class Assignment in Swahili: A Computational Account
Giada Palmieri
Konstantinos Kogkalidis
94
0
0
16 Oct 2024
Interpreting token compositionality in LLMs: A robustness analysis
Interpreting token compositionality in LLMs: A robustness analysis
Nura Aljaafari
Danilo S. Carvalho
André Freitas
433
3
0
16 Oct 2024
Tokenization and Morphology in Multilingual Language Models: A
  Comparative Analysis of mT5 and ByT5
Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5
Thao Anh Dang
Limor Raviv
Lukas Galke
310
9
0
15 Oct 2024
LargePiG: Your Large Language Model is Secretly a Pointer Generator
LargePiG: Your Large Language Model is Secretly a Pointer Generator
Zhongxiang Sun
Zihua Si
Xiaoxue Zang
Kai Zheng
Yang Song
Xiao Zhang
Jun Xu
HILMRALM
227
0
0
15 Oct 2024
Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank Adaptations
Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank AdaptationsInformation Fusion (Inf. Fusion), 2024
M. Germán-Morales
A. J. Rivera-Rivas
M. J. del Jesus Díaz
C. J. Carmona
AI4TSAI4CE
730
7
0
15 Oct 2024
ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration
ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration
Aunabil Chakma
Aditya Chakma
Soham Khisa
Chumui Tripura
Masum Hasan
Rifat Shahriyar
104
3
0
14 Oct 2024
Language Model Embeddings Can Be Sufficient for Bayesian Optimization
Language Model Embeddings Can Be Sufficient for Bayesian Optimization
Tung Nguyen
Qiuyi Zhang
Bangding Yang
Chansoo Lee
J. Bornschein
Yingjie Miao
Sagi Perel
Yutian Chen
Xingyou Song
BDL
357
11
0
14 Oct 2024
Text Classification using Graph Convolutional Networks: A Comprehensive
  Survey
Text Classification using Graph Convolutional Networks: A Comprehensive SurveyACM Computing Surveys (ACM CSUR), 2024
Syed Mustafa Haider Rizvi
Ramsha Imran
Arif Mahmood
GNNOODFaML
197
9
0
12 Oct 2024
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?
Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?International Conference on Learning Representations (ICLR), 2024
HyoJung Han
Akiko Eriguchi
Haoran Xu
Hieu T. Hoang
Marine Carpuat
Huda Khayrallah
VLM
238
8
0
12 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with
  Mask Referring Modeling
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingNeural Information Processing Systems (NeurIPS), 2024
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
434
22
0
10 Oct 2024
Previous
123456...404142
Next