SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Taku Kudo

John Richardson

ArXiv (abs)PDF HTML Github (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown

Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow

Cyrile Delestre

Yoann Sola

10 Oct 2024

Transducer Consistency Regularization for Speech to Text ApplicationsSpoken Language Technology Workshop (SLT), 2024

Cindy Tseng

Yun Tang

Vijendra Raj Apsingekar

289

09 Oct 2024

Generative Model for Less-Resourced Language with 1 billion parameters

171

09 Oct 2024

Inference over Unseen Entities, Relations and Literals on Knowledge Graphs

Caglar Demir

N'Dah Jean Kouagou

Arnab Sharma

Axel-Cyrille Ngonga Ngomo

177

09 Oct 2024

DEPT: Decoupled Embeddings for Pre-training Language ModelsInternational Conference on Learning Representations (ICLR), 2024

William F. Shen

Dongqi Cai

Nicholas D. Lane

1.4K

07 Oct 2024

Language Model-Driven Data Pruning Enables Efficient Active Learning

283

05 Oct 2024

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

137

04 Oct 2024

Cross-lingual Transfer for Automatic Question Generation by Learning Interrogative Structures in Target LanguagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Seonjeong Hwang

Yunsu Kim

Gary Geunbae Lee

214

04 Oct 2024

MELODI: Exploring Memory Compression for Long ContextsInternational Conference on Learning Representations (ICLR), 2024

194

04 Oct 2024

No Need to Talk: Asynchronous Mixture of Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Anastasiia Filippova

Angelos Katharopoulos

David Grangier

Ronan Collobert

MoE

369

04 Oct 2024

Morphological evaluation of subwords vocabulary used by BETO language model

Óscar García-Sierra

Ana Fernández-Pampillón Cesteros

Miguel Ortega-Martín

216

03 Oct 2024

Selective Attention Improves TransformerInternational Conference on Learning Representations (ICLR), 2024

Yaniv Leviathan

Matan Kalman

Yossi Matias

346

03 Oct 2024

HAINAN: Fast and Accurate Transducer for Hybrid-Autoregressive ASRInternational Conference on Learning Representations (ICLR), 2024

960

03 Oct 2024

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Kun Kuang

Changlong Sun

Fei Wu

160

02 Oct 2024

FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

Zhidong Gao

Yu Zhang

Zhenxiao Zhang

Yanmin Gong

Yuanxiong Guo

155

01 Oct 2024

Enhancing High-order Interaction Awareness in LLM-based Recommender ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

241

30 Sep 2024

Universal Medical Image Representation Learning with Compositional Decoders

Kaini Wang

Ling Yang

Siping Zhou

Guangquan Zhou

Wentao Zhang

Bin Cui

Shuo Li

SSL MedIm

288

30 Sep 2024

AfriHuBERT: A self-supervised speech representation model for African languages

Jesujoba Oluwadara Alabi

437

30 Sep 2024

Exploring Language Model Generalization in Low-Resource Extractive QAInternational Conference on Computational Linguistics (COLING), 2024

Suhang Wang

285

27 Sep 2024

Convolutional Signal Propagation: A Simple Scalable Algorithm for Hypergraphs

206

26 Sep 2024

LangSAMP: Language-Script Aware Multilingual PretrainingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

511

26 Sep 2024

How Transliterations Improve Crosslingual AlignmentInternational Conference on Computational Linguistics (COLING), 2024

Yihong Liu

Mingyang Wang

Amir Hossein Kargaran

Ayyoob Imani

Hinrich Schütze

206

25 Sep 2024

EuroLLM: Multilingual Language Models for Europe

Pedro Henrique Martins

Patrick Fernandes

...

Alexandra Birch

André F. T. Martins

228

24 Sep 2024

Multilingual Transfer and Domain Adaptation for Low-Resource Languages of SpainConference on Machine Translation (WMT), 2024

Zongyao Li

...

Shaojun Li

Jinlong Yang

Yuhao Xie

Jiawei Zheng Bin Wei

Hao Yang

112

24 Sep 2024

Machine Translation Advancements of Low-Resource Indian Languages by Transfer LearningConference on Machine Translation (WMT), 2024

Bin Wei

Zongyao Li

...

Jinlong Yang

Yuhao Xie

Hao Yang

VLM

127

24 Sep 2024

dnaGrinder: a lightweight and high-capacity genomic foundation model

Qihang Zhao

Chi Zhang

Weixiong Zhang

172

24 Sep 2024

HW-TSC's Submission to the CCMT 2024 Machine Translation Tasks

Zhanglin Wu

Yuanchang Luo

Daimeng Wei

Jiawei Zheng

Bin Wei

...

Jiaxin Guo

Shaojun Li

Mengli Zhu

Ning Xie

Hao Yang

206

23 Sep 2024

Choose the Final Translation from NMT and LLM hypotheses Using MBR Decoding: HW-TSC's Submission to the WMT24 General MT Shared TaskConference on Machine Translation (WMT), 2024

Zhanglin Wu

Daimeng Wei

Zongyao Li

Hengchao Shang

Jiaxin Guo

Shaojun Li

Zhiqiang Rao

Yuanchang Luo

Ning Xie

Hao Yang

187

23 Sep 2024

Cross-Domain Content Generation with Domain-Specific Small Language Models

Ankit Maloo

Abhinav Garg

CLL

214

19 Sep 2024

An Efficient Self-Learning Framework For Interactive Spoken Dialog SystemsInternational Conference on Machine Learning (ICML), 2024

175

16 Sep 2024

PixelBytes: Catching Unified Representation for Multimodal Generation

Fabien Furfaro

123

16 Sep 2024

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Abdelkader El Mahdaouy

Salima Lamsiyah

Meryem Janati Idrissi

H. Alami

Zakaria Yartaoui

Ismail Berrada

142

13 Sep 2024

Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration ApproachConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Siqi Li

Danni Liu

Jan Niehues

270

13 Sep 2024

Retro-li: Small-Scale Retrieval Augmented Generation Supporting Noisy Similarity Searches and Domain Shift GeneralizationEuropean Conference on Artificial Intelligence (ECAI), 2024

Abu Sebastian

533

12 Sep 2024

TeXBLEU: Automatic Metric for Evaluate LaTeX FormatIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

314

10 Sep 2024

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer TrainingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

236

06 Sep 2024

Open Language Data Initiative: Advancing Low-Resource Machine Translation for KarakalpakConference on Machine Translation (WMT), 2024

Mukhammadsaid Mamasaidov

Abror Shopulatov

VLM

109

06 Sep 2024

The AdEMAMix Optimizer: Better, Faster, OlderInternational Conference on Learning Representations (ICLR), 2024

Matteo Pagliardini

Pierre Ablin

David Grangier

ODL

322

05 Sep 2024

Multi-modal Situated Reasoning in 3D ScenesNeural Information Processing Systems (NeurIPS), 2024

Baoxiong Jia

Siyuan Huang

358

04 Sep 2024

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASRSpoken Language Technology Workshop (SLT), 2024

Weiqing Wang

Kunal Dhawan

Taejin Park

Jagadeesh Balam

Boris Ginsburg

226

02 Sep 2024

Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip ScriptsInternational Conference on Computational Linguistics (COLING), 2024

Yingfa Chen

Chenlong Hu

Cong Feng

Chenyang Song

Shi Yu

Xu Han

Zhiyuan Liu

Maosong Sun

155

02 Sep 2024

Towards Tailored Recovery of Lexical Diversity in Literary Machine TranslationEuropean Association for Machine Translation Conferences/Workshops (EAMT), 2024

183

30 Aug 2024

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

344

29 Aug 2024

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

Konstantin Dobler

Gerard de Melo

204

28 Aug 2024

Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras

247

28 Aug 2024

Positional Description for Numerical NormalizationInterspeech (Interspeech), 2024

Deepanshu Gupta

Javier Latorre

3DGS

159

22 Aug 2024

Distributional Properties of Subword RegularizationConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Marco Cognetta

Vilém Zouhar

Naoaki Okazaki

176

21 Aug 2024

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse VocabulariesConference on Machine Translation (WMT), 2024

Sai Koneru

Matthias Huck

M. Exel

Jan Niehues

191

21 Aug 2024

Goldfish: Monolingual Language Models for 350 Languages

Zhuowen Tu

271

19 Aug 2024

Language-Informed Beam Search Decoding for Multilingual Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Yilin Yang

Stefan Lee

Prasad Tadepalli

166

11 Aug 2024