SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown

Nyonic Technical Report

152

24 Apr 2024

Multi-Head Mixture-of-Experts

243

23 Apr 2024

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Kevin Slagle

216

22 Apr 2024

Less Peaky and More Accurate CTC Forced Alignment by Label Priors

...

Shinji Watanabe

Sanjeev Khudanpur

353

22 Apr 2024

TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

Aleksei Dorkin

Kairit Sirts

127

19 Apr 2024

Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair

Yusuke Sakai

Mana Makinae

Hidetaka Kamigaito

Taro Watanabe

226

18 Apr 2024

Neuron Specialization: Leveraging intrinsic task modularity for multilingual machine translation

Shaomu Tan

Di Wu

Christof Monz

MoMe

303

17 Apr 2024

Language Model Cascades: Token-level uncertainty and beyond

Neha Gupta

Harikrishna Narasimhan

Sanjiv Kumar

443

15 Apr 2024

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

248

14 Apr 2024

TransformerFAM: Feedback attention is working memory

418

14 Apr 2024

The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments

414

11 Apr 2024

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

George-Christian Muraru

...

164

11 Apr 2024

Interactive Prompt Debugging with Sequence Salience

177

11 Apr 2024

High-Dimension Human Value Representation in Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

619

11 Apr 2024

Analyzing the Performance of Large Language Models on Code SummarizationInternational Conference on Language Resources and Evaluation (LREC), 2024

Rajarshi Haldar

Anjali Narayan-Chen

197

10 Apr 2024

On the Effect of (Near) Duplicate Subwords in Language ModellingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

250

09 Apr 2024

Towards Robust Domain Generation Algorithm ClassificationACM Asia Conference on Computer and Communications Security (AsiaCCS), 2024

197

09 Apr 2024

Interplay of Machine Translation, Diacritics, and Diacritization

Wei-Rui Chen

Ife Adebara

Muhammad Abdul-Mageed

270

09 Apr 2024

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

...

Ge Zhang

311

05 Apr 2024

Training LLMs over Neurally Compressed Text

Jascha Narain Sohl-Dickstein

Noah Constant

206

04 Apr 2024

SemGrasp: Semantic Grasp Generation via Language Aligned DiscretizationEuropean Conference on Computer Vision (ECCV), 2024

256

04 Apr 2024

Dynamic Neural Control Flow Execution: An Agent-Based Deep Equilibrium Approach for Binary Vulnerability DetectionInternational Conference on Information and Knowledge Management (CIKM), 2024

160

03 Apr 2024

PejorativITy: Disambiguating Pejorative Epithets to Improve Misogyny Detection in Italian TweetsInternational Conference on Language Resources and Evaluation (LREC), 2024

Arianna Muti

Alberto Barrón-Cedeño

101

03 Apr 2024

PhonologyBench: Evaluating Phonological Skills of Large Language Models

296

03 Apr 2024

Revisiting subword tokenization: A case study on affixal negation in large language modelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Karin Verspoor

207

03 Apr 2024

Low-resource neural machine translation with morphological modeling

Antoine Nzeyimana

261

03 Apr 2024

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

251

02 Apr 2024

MotionChain: Conversational Motion Controllers via Multimodal PromptsEuropean Conference on Computer Vision (ECCV), 2024

276

02 Apr 2024

Release of Pre-Trained Models for the Japanese LanguageInternational Conference on Language Resources and Evaluation (LREC), 2024

206

02 Apr 2024

Scaling Properties of Speech Language Models

Santiago Cuervo

R. Marxer

280

31 Mar 2024

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Francois Meyer

Jan Buys

306

29 Mar 2024

IDGenRec: LLM-RecSys Alignment with Textual ID Learning

Juntao Tan

179

27 Mar 2024

CYCLE: Learning to Self-Refine the Code Generation

243

27 Mar 2024

Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding

306

27 Mar 2024

Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction

Inhwan Bae

Junoh Lee

Hae-Gon Jeon

359

27 Mar 2024

mALBERT: Is a Compact Multilingual BERT Model Still Worth It?

Christophe Servan

Sahar Ghannay

Sophie Rosset

163

27 Mar 2024

Provably Secure Disambiguating Neural Linguistic Steganography

144

26 Mar 2024

Making Sentence Embeddings Robust to User-Generated Content

197

25 Mar 2024

Understanding Emergent Abilities of Language Models from the Loss PerspectiveNeural Information Processing Systems (NeurIPS), 2024

Yuxiao Dong

398

23 Mar 2024

AI for Biomedicine in the Era of Large Language Models

Sajib Acharjee Dip

195

23 Mar 2024

Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices

Stephan Ludger Kölker

Zhefeng Wang

Xiaoming Yuan

182

22 Mar 2024

^3

AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Zhe Chen

Heyang Liu

Wenyi Yu

Guangzhi Sun

Chao Zhang

175

21 Mar 2024

Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement

239

20 Mar 2024

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Xie Chen

228

20 Mar 2024

Self-generated Replay Memories for Continual Neural Machine Translation

Michele Resta

Davide Bacciu

CLL

239

19 Mar 2024

Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models

Zhixue Zhao

Nikolaos Aletras

224

19 Mar 2024

Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing SystemsInternational Conference on Language Resources and Evaluation (LREC), 2024

Bo-Han Lu

Yi-Hsuan Lin

En-Shiun Annie Lee

Richard Tzong-Han Tsai

163

18 Mar 2024

Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

...

282

16 Mar 2024

Exploring Chinese Humor Generation: A Study on Two-Part Allegorical Sayings

Rongwu Xu

288

16 Mar 2024

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

Luke Zettlemoyer

305

15 Mar 2024