SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,064 papers shown

Language Model Tokenizers Introduce Unfairness Between LanguagesNeural Information Processing Systems (NeurIPS), 2023

346

173

17 May 2023

Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Markus Freitag

Behrooz Ghorbani

Patrick Fernandes

212

17 May 2023

Sasha: Creative Goal-Oriented Reasoning in Smart Homes with Large Language ModelsProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2023

159

16 May 2023

AR-Diffusion: Auto-Regressive Diffusion Model for Text GenerationNeural Information Processing Systems (NeurIPS), 2023

...

391

116

16 May 2023

Towards Speech Dialogue Translation Mediating Speakers of Different LanguagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Shuichiro Shimizu

Chenhui Chu

Sheng Li

Sadao Kurohashi Kyoto University

157

16 May 2023

Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector

Derguene Mbaye

Moussa Diallo

111

15 May 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale TransformersNeural Information Processing Systems (NeurIPS), 2023

Luke Zettlemoyer

296

136

12 May 2023

Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*Portuguese Conference on Artificial Intelligence (EPIA), 2023

João Rodrigues

Henrique Lopes Cardoso

T. Osório

167

11 May 2023

What is the best recipe for character-level encoder-only modelling?Annual Meeting of the Association for Computational Linguistics (ACL), 2023

Kris Cao

149

09 May 2023

Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched DataAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Robert Litschko

Ekaterina Artemova

Barbara Plank

203

09 May 2023

Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Xuandi Fu

Kanthashree Mysore Sathyendra

Athanasios Mouchtaris

296

09 May 2023

CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource LanguagesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

255

09 May 2023

Fast Conformer with Linearly Scalable Attention for Efficient Speech RecognitionAutomatic Speech Recognition & Understanding (ASRU), 2023

...

Krishna Puvvada

Jagadeesh Balam

Boris Ginsburg

330

144

08 May 2023

Leveraging Synthetic Targets for Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Sarthak Mittal

Oleksii Hrinchuk

Oleksii Kuchaiev

147

07 May 2023

Two to Five Truths in Non-Negative Matrix FactorizationInternational Workshop on Complex Networks & Their Applications (CNTA), 2023

215

06 May 2023

Pre-training Language Model as a Multi-perspective Course LearnerAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

204

06 May 2023

Now It Sounds Like You: Learning Personalized Vocabulary On DeviceAAAI Spring Symposia (ASS), 2023

320

05 May 2023

Investigating Lexical Sharing in Multilingual Machine Translation for Indian LanguagesEuropean Association for Machine Translation Conferences/Workshops (EAMT), 2023

Sonal Sannigrahi

Rachel Bawden

140

04 May 2023

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

262

04 May 2023

What changes when you randomly choose BPE merge operations? Not muchFirst Workshop on Insights from Negative Results in NLP (Insights), 2023

Jonne Saleva

Constantine Lignos

146

04 May 2023

Learning Language-Specific Layers for Multilingual Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

256

04 May 2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic CapacityConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

229

03 May 2023

Low-Resourced Machine Translation for Senegalese Wolof Language

Derguene Mbaye

Moussa Diallo

T. Diop

163

01 May 2023

ResiDual: Transformer with Dual Residual Connections

Shufang Xie

Huishuai Zhang

Junliang Guo

Xu Tan

Jiang Bian

Hany Awadalla

Arul Menezes

Tao Qin

Rui Yan

168

28 Apr 2023

Training and Evaluation of a Multilingual Tokenizer for GPT-SW3

Felix Stollenwerk

210

28 Apr 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Jiabo Ye

...

Ji Zhang

Jingren Zhou

1.1K

1,164

27 Apr 2023

Semantic Tokenizer for Enhanced Natural Language Processing

Cornelia Caragea

24 Apr 2023

NAIST-SIC-Aligned: an Aligned English-Japanese Simultaneous Interpretation Corpus

348

23 Apr 2023

Tokenization Preference for Human and Machine Learning Model: An Annotation Study

Tatsuya Hiraoka

Tomoya Iwakura

169

21 Apr 2023

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

Tatsuya Hiraoka

Tomoya Iwakura

119

21 Apr 2023

Joint Repetition Suppression and Content Moderation of Large Language Models

227

20 Apr 2023

MPMQA: Multimodal Question Answering on Product ManualsAAAI Conference on Artificial Intelligence (AAAI), 2023

Liangfu Zhang

Anwen Hu

Jing Zhang

Shuo Hu

Qin Jin

198

19 Apr 2023

UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual PretrainingInternational Conference on Learning Representations (ICLR), 2023

Sharan Narang

282

101

18 Apr 2023

From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation

Adarsh Kumar

Pedro Sarmento

191

18 Apr 2023

Transfer to a Low-Resource Language via Close Relatives: The Case Study on FaroeseNordic Conference of Computational Linguistics (NODALIDA), 2023

Vésteinn Snaebjarnarson

A. Simonsen

Goran Glavaš

Ivan Vulić

252

18 Apr 2023

A Survey for Biomedical Text Summarization: From Pre-trained to Large Language Models

Zheheng Luo

211

18 Apr 2023

The MiniPile Challenge for Data-Efficient Language Models

Jean Kaddour

MoE ALM

320

17 Apr 2023

VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Zhen-Ru Zhang

Chuanqi Tan

Songfang Huang

Fei Huang

VLM

156

17 Apr 2023

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

291

391

17 Apr 2023

Neural Machine Translation For Low Resource Languages

V. Goyle

Parvathy Krishnaswamy

K. G. Ravikumar

Utsa Chattopadhyay

Kartikay Goyle

16 Apr 2023

Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation

Xiangang Li

240

16 Apr 2023

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech RecognitionIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2023

173

15 Apr 2023

Sign Language Translation from Instructional Videos

239

13 Apr 2023

Computational modeling of semantic changeConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023

Nina Tahmasebi

Haim Dubossarsky

293

13 Apr 2023

Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis

Qingxiu Dong

Lingpeng Kong

Lei Li

369

226

10 Apr 2023

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Alireza Salemi

Amirhossein Abaskohi

Sara Tavakoli

Yadollah Yaghoobzadeh

A. Shakery

AIMat

227

03 Apr 2023

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domainsbioRxiv (bioRxiv), 2023

315

03 Apr 2023

GreekBART: The First Pretrained Greek Sequence-to-Sequence ModelInternational Conference on Language Resources and Evaluation (LREC), 2023

Iakovos Evdaimon

Hadi Abdine

Christos Xypolopoulos

Stamatis Outsios

Michalis Vazirgiannis

Giorgos Stamou

VLM

112

03 Apr 2023

Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

Maxime Labonne

Sean J. Moran

300

03 Apr 2023

Exploiting Multilingualism in Low-resource Neural Machine Translation via Adversarial Learning

186

31 Mar 2023