SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,064 papers shown

Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data

Xinzhe Li

Ming Liu

Shang Gao

222

02 Jul 2023

SMILE: Evaluation and Domain Adaptation for Social Media Language UnderstandingKnowledge Discovery and Data Mining (KDD), 2023

158

30 Jun 2023

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMsNeural Information Processing Systems (NeurIPS), 2023

...

Alexander G. Hauptmann

Lu Jiang

MLLM

360

30 Jun 2023

X-RiSAWOZ: High-Quality End-to-End Multilingual Dialogue Datasets and Few-shot AgentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

...

243

30 Jun 2023

A Formal Perspective on Byte-Pair EncodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

205

29 Jun 2023

Accelerating Transducers through Adjacent Token MergingInterspeech (Interspeech), 2023

171

28 Jun 2023

Extending Context Window of Large Language Models via Positional Interpolation

435

678

27 Jun 2023

CamemBERT-bio: Leveraging Continual Pre-training for Cost-Effective Models on French Biomedical DataInternational Conference on Language Resources and Evaluation (LREC), 2023

Rian Touchent

Laurent Romary

Eric Villemonte de la Clergerie

MedIm

212

27 Jun 2023

YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel CorpusNeural Information Processing Systems (NeurIPS), 2023

273

27 Jun 2023

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

289

309

26 Jun 2023

MotionGPT: Human Motion as a Foreign LanguageNeural Information Processing Systems (NeurIPS), 2023

Jingyi Yu

Tao Chen

292

450

26 Jun 2023

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

167

26 Jun 2023

Resume Information Extraction via Post-OCR Text Processing

Selahattin Serdar Helli

Senem Tanberk

Sena Nur Cavsak

23 Jun 2023

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul Kishan Rubenstein

Chulayuth Asawaroengchai

...

257

396

22 Jun 2023

Towards Accurate Translation via Semantically Appropriate Application of Lexical ConstraintsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

278

21 Jun 2023

Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech RecognitionInterspeech (Interspeech), 2023

189

20 Jun 2023

Rehearsal-Free Online Continual Learning for Automatic Speech RecognitionInterspeech (Interspeech), 2023

Steven Vander Eeckt

Hugo Van hamme

CLL

113

19 Jun 2023

Guiding Language Models of Code with Global Context using Monitors

335

19 Jun 2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech RepresentationInterspeech (Interspeech), 2023

Xie Chen

154

15 Jun 2023

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

Kunal Dhawan

KDimating Rekesh

Boris Ginsburg

247

14 Jun 2023

Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation DataInternational Workshop on Spoken Language Translation (IWSLT), 2023

193

14 Jun 2023

CipherSniffer: Classifying Cipher Types

Brendan Artley

G. Mehdiyev

13 Jun 2023

Tokenization with Factorized Subword EncodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

David Samuel

Lilja Øvrelid

191

13 Jun 2023

Modality Adaption or Regularization? A Case Study on End-to-End Speech TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Yucheng Han

Chen Xu

Tong Xiao

Jingbo Zhu

205

13 Jun 2023

Measuring Sentiment Bias in Machine TranslationInternational Conference on Text, Speech and Dialogue (TSD), 2023

174

12 Jun 2023

Multi-View Frequency-Attention Alternative to CNN Frontends for Automatic Speech RecognitionInterspeech (Interspeech), 2023

149

12 Jun 2023

Learning Multilingual Sentence Representations with Cross-lingual Consistency RegularizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

163

12 Jun 2023

AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language ProcessingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

...

140

11 Jun 2023

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and BenchmarkNeural Information Processing Systems (NeurIPS), 2023

...

Xiaoshui Huang

Zhiyong Wang

Jing Shao

Wanli Ouyang

MLLM

277

205

11 Jun 2023

Morphosyntactic probing of multilingual BERT modelsNatural Language Engineering (NLE), 2023

188

09 Jun 2023

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech RecognitionInterspeech (Interspeech), 2023

121

09 Jun 2023

KIT's Multilingual Speech Translation System for IWSLT 2023International Workshop on Spoken Language Translation (IWSLT), 2023

Danni Liu

179

08 Jun 2023

Privately generating tabular data using language models

Alexandre Sablayrolles

Yue Wang

Brian Karrer

LMTD

163

07 Jun 2023

Zambezi Voice: A Multilingual Speech Corpus for Zambian LanguagesInterspeech (Interspeech), 2023

Antonios Anastasopoulos

261

07 Jun 2023

Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based AugmentationInterspeech (Interspeech), 2023

168

07 Jun 2023

LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam

370

06 Jun 2023

SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning

328

06 Jun 2023

Enhancing Language Representation with Constructional Information for Natural Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

170

05 Jun 2023

End-to-End Word-Level Pronunciation Assessment with MASK Pre-trainingInterspeech (Interspeech), 2023

Huiqiang Jiang

Yuqing Yang

Dongsheng Li

Linli Xu

Lili Qiu

CVBM

152

05 Jun 2023

Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language ModelInterspeech (Interspeech), 2023

191

05 Jun 2023

DocFormerv2: Local Features for Document UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2023

248

02 Jun 2023

Data-Efficient French Language Modeling with CamemBERTaAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Wissam Antoun

Benoît Sagot

Djamé Seddah

152

02 Jun 2023

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMTEuropean Association for Machine Translation Conferences/Workshops (EAMT), 2023

Ljiljana Dolamic Andrei Popescu-Belis

199

02 Jun 2023

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with PunctuationInterspeech (Interspeech), 2023

114

02 Jun 2023

Hierarchical Attention Encoder Decoder

Asier Mujika

BDL

229

01 Jun 2023

Strategies for improving low resource speech to text translation relying on pre-trained ASR modelsInterspeech (Interspeech), 2023

162

31 May 2023

How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive BiasesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Aaron Mueller

Tal Linzen

AI4CE

194

31 May 2023

Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Josef Jon

Ondrej Bojar

143

30 May 2023

Intriguing Properties of Quantization at ScaleNeural Information Processing Systems (NeurIPS), 2023

231

30 May 2023

Towards Selection of Text-to-speech Data to Augment ASR Training

Ozlem Kalinli

113

30 May 2023