SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Pooya Fayyazsanavi

Antonios Anastasopoulos

Jana Kosecka

SLR

213

01 Jul 2024

Calibrated Large Language Models for Binary Question Answering

Patrizio Giovannotti

Alexander Gammerman

210

01 Jul 2024

xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using Sequence-to-Sequence Models

Kiran Busch

T. Kampik

Henrik Leopold

103

28 Jun 2024

Token-Weighted RNN-T for Learning from Flawed Data

Gil Keren

Wei Zhou

Ozlem Kalinli

263

26 Jun 2024

PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Linqing Chen

...

Lisha Zhang

346

26 Jun 2024

Efficient Document Ranking with Learnable Late Interactions

Ziwei Ji

242

25 Jun 2024

CharED: Character-wise Ensemble Decoding for Large Language Models

249

25 Jun 2024

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Paarth Neekhara

Shehzeen Samarah Hussain

Subhankar Ghosh

200

25 Jun 2024

Data curation via joint example selection further accelerates multimodal learning

301

25 Jun 2024

Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

24 Jun 2024

Understanding and Mitigating Tokenization Bias in Language Models

257

24 Jun 2024

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

342

24 Jun 2024

Large Vocabulary Size Improves Large Language Models

311

24 Jun 2024

Revisiting Interpolation Augmentation for Speech-to-Text Generation

Chen Xu

Jingbo Zhu

193

22 Jun 2024

TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers

Yakun Song

Zhuo Chen

Xiaofei Wang

Ziyang Ma

Guanrou Yang

Xie Chen

AuLLM

128

22 Jun 2024

Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

Bhuvana Ramabhadran

181

20 Jun 2024

Exploring Design Choices for Building Language-Specific LLMs

Atula Tejaswi

Nilesh Gupta

Eunsol Choi

248

20 Jun 2024

How to Compute the Probability of a Word

Tiago Pimentel

Clara Meister

244

20 Jun 2024

Infusing clinical knowledge into tokenisers for language models

Beatrice Alex

186

20 Jun 2024

On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?

308

20 Jun 2024

Lexically Grounded Subword SegmentationConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Jindřich Libovický

Jindřich Helcl

245

19 Jun 2024

How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?

Pranav Gaikwad

Meet Doshi

Mary Dabre

Pushpak Bhattacharyya

200

19 Jun 2024

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Jinhyuk Lee

Anthony Chen

Zhuyun Dai

Dheeru Dua

Devendra Singh Sachan

...

Sebastian Riedel

228

19 Jun 2024

Nemotron-4 340B Technical Report

Nvidia

Bo Adler

Niket Agarwal

Ashwath Aithal

...

Jimmy Zhang

Jing Zhang

Vivienne Zhang

Yian Zhang

Chen Zhu

301

111

17 Jun 2024

Tokenization Falling Short: The Curse of Tokenization

213

17 Jun 2024

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

Sheng Feng

Heyang Liu

Yu Wang

Yanfeng Wang

106

17 Jun 2024

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Zebang Cheng

Zhi-Qi Cheng

Jun-Yan He

Yuxuan Zhou

Kai Wang

Yuxiang Lin

Zheng Lian

Xiaojiang Peng

Alexander G. Hauptmann

MLLM

248

115

17 Jun 2024

Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation

558

17 Jun 2024

CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving

Bhavani Shankar

Preethi Jyothi

Pushpak Bhattacharyya

313

16 Jun 2024

Multilingual Large Language Models and Curse of Multilinguality

Daniil Gurgurov

Tanja Bäumel

Tatiana Anikina

304

15 Jun 2024

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition ChallengeInterspeech (Interspeech), 2024

Zehua Liu

187

14 Jun 2024

UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource LanguagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Trinh Pham

Khoi M. Le

Luu Anh Tuan

363

14 Jun 2024

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Oğuzhan Fatih Kar

Mingfei Gao

268

13 Jun 2024

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Iwen E. Kang

Christophe Van Gysel

Man-Hung Siu

227

12 Jun 2024

An Empirical Study of Mamba-based Language Models

...

Jan Kautz

326

142

12 Jun 2024

PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

191

12 Jun 2024

Languages Transferred Within the Encoder: On Representation Transfer in Zero-Shot Multilingual Translation

Zhi Qu

Chenchen Ding

Taro Watanabe

299

12 Jun 2024

A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

Zhengrui Ma

Qingkai Fang

Shaolei Zhang

Shoutao Guo

Yang Feng

Min Zhang

218

11 Jun 2024

EAVE: Efficient Product Attribute Value Extraction via Lightweight Sparse-layer Interaction

Lifu Huang

189

10 Jun 2024

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History SelectionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

382

10 Jun 2024

Attention as a HypernetworkInternational Conference on Learning Representations (ICLR), 2024

269

09 Jun 2024

Exploring the Benefits of Tokenization of Discrete Acoustic UnitsInterspeech (Interspeech), 2024

Avihu Dekel

Raul Fernandez

158

08 Jun 2024

Large Language Model-guided Document Selection

Xiang Kong

Tom Gunter

Ruoming Pang

192

07 Jun 2024

Recovering document annotations for sentence-level bitext

R. Wicks

Matt Post

Philipp Koehn

275

06 Jun 2024

Enhancing CTC-based speech recognition with diverse modeling units

Zhen Huang

339

05 Jun 2024

StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning

Shaolei Zhang

Yang Feng

258

05 Jun 2024

LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation

Zengkui Sun

Yijin Liu

Fandong Meng

Jinan Xu

Jie Zhou

323

05 Jun 2024

Xmodel-LM Technical Report

Yichuan Wang

Qun Wang

266

05 Jun 2024

Multi-word Term Embeddings Improve Lexical Product Retrieval

Viktor Shcherbakov

Fedor Krasnov

172

03 Jun 2024

Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation

Bar Iluz

Yanai Elazar

Asaf Yehudai

Gabriel Stanovsky

198

02 Jun 2024