SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown

AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual CorporaNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

189

08 Aug 2024

EMTeC: A Corpus of Eye Movements on Machine-Generated TextsBehavior Research Methods (BRM), 2024

Lena S. Bolliger

Patrick Haller

Isabelle Caroline Rose Cretton

D. R. Reich

Tannon Kew

Lena Ann Jäger

213

08 Aug 2024

Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization

335

08 Aug 2024

Semantics or spelling? Probing contextual word embeddings with orthographic noiseAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Jacob A. Matthews

John R. Starr

Marten van Schijndel

272

08 Aug 2024

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Faisal Qarah

240

07 Aug 2024

SETN: Stock Embedding Enhanced with Textual and Network Information

Takehiro Takayanagi

Hiroki Sakaji

Kiyoshi Izumi

AIFin

288

06 Aug 2024

Compromising Embodied Agents with Contextual Backdoor Attacks

...

250

06 Aug 2024

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

237

05 Aug 2024

Batching BPE Tokenization Merges

Alexander P. Morgan

154

05 Aug 2024

SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

232

05 Aug 2024

Advancing Post-OCR Correction: A Comparative Study of Synthetic DataAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Shuhao Guan

Derek Greene

292

05 Aug 2024

Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic FeaturesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Mengyu Bu

Shuhao Gu

Yang Feng

333

02 Aug 2024

Leveraging Entailment Judgements in Cross-Lingual SummarisationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Huajian Zhang

Laura Perez-Beltrachini

HILM

201

01 Aug 2024

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team

Gemma Team Morgane Riviere

...

620

1,566

31 Jul 2024

Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning

Omer Nacar

Anis Koubaa

156

30 Jul 2024

Sentiment Analysis of Lithuanian Online Reviews Using Large Language ModelsInternational Conference on Information Technology (ICIT), 2024

Brigita Vileikyt.e

M. Lukoševičius

Lukas Stankevicius

176

29 Jul 2024

Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation

Doan Nam Long Vu

Timour Igamberdiev

Ivan Habernal

173

26 Jul 2024

Unified Lexical Representation for Interpretable Visual-Language Alignment

207

25 Jul 2024

Coupling Speech Encoders with Downstream Text Models

Ciprian Chelba

J. Schalkwyk

AuLLM

177

24 Jul 2024

Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models

Yida Zhao

Chao Lou

Kewei Tu

212

24 Jul 2024

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Shinji Watanabe

223

23 Jul 2024

Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

Sanjiv Kumar

Andrej Risteski

306

22 Jul 2024

CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

Yeeun Kang

201

19 Jul 2024

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

175

18 Jul 2024

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

Jian You

Xiangfeng Li

147

18 Jul 2024

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

Xuesong Niu

228

17 Jul 2024

Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts

Lauren Levine

Cindy Tung Li

Lydia Bremer-McCollum

Nicholas Wagner

Amir Zeldes

RALM

144

17 Jul 2024

DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents

Thomas Constum

Pierrick Tranouez

Thierry Paquet

180

12 Jul 2024

Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models

Linghao Jin

Li An

Xuezhe Ma

297

12 Jul 2024

Mitigating Catastrophic Forgetting in Language Transfer via Model Merging

Ce Zhang

396

11 Jul 2024

Automata-based constraints for language model decoding

363

11 Jul 2024

Learning Program Behavioral Models from Synthesized Input-Output Pairs

275

11 Jul 2024

Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

276

09 Jul 2024

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Zhilin Wang

294

08 Jul 2024

MST5 -- Multilingual Question Answering over Knowledge Graphs

154

08 Jul 2024

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

Anoop Kunchukuttan

194

08 Jul 2024

Large Language Models Understand Layout

322

08 Jul 2024

Do Multilingual Large Language Models Mitigate Stereotype Bias?

334

08 Jul 2024

Statistical investigations into the geometry and homology of random programs

Jon Sporring

Ken Friis Larsen

05 Jul 2024

Toucan: Many-to-Many Translation for 150 African Language Pairs

AbdelRahim Elmadany

Ife Adebara

Muhammad Abdul-Mageed

256

05 Jul 2024

Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models

169

05 Jul 2024

TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR

Juan Zuluaga-Gomez

220

05 Jul 2024

Serialized Output Training by Learned Dominance

Ying Shi

141

04 Jul 2024

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Xing Han Lù

209

04 Jul 2024

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Kunal Dhawan

Nithin Rao Koluguri

Ante Jukić

Ryan Langman

Jagadeesh Balam

Boris Ginsburg

222

03 Jul 2024

Single Character Perturbations Break LLM Alignment

832

03 Jul 2024

Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data

212

03 Jul 2024

A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning

Ramakrishna Appicharla

Baban Gain

Santanu Pal

Asif Ekbal

Pushpak Bhattacharyya

204

03 Jul 2024

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

224

02 Jul 2024

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

Yan Meng

Di Wu

Christof Monz

345

02 Jul 2024