SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,063 papers shown

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human MotionComputer Vision and Pattern Recognition (CVPR), 2024

356

13 Dec 2024

Efficient Continual Pre-training of LLMs for Low-resource LanguagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

287

13 Dec 2024

Multi-Head Encoding for Extreme Label ClassificationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

259

13 Dec 2024

PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model

Davor Lauc

220

12 Dec 2024

Scaling Sequential Recommendation Models with TransformersAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

Pablo Zivic

Hernán Ceferino Vázquez

Jorge Sanchez

OffRL LRM

291

10 Dec 2024

Representation Purification for End-to-End Speech TranslationInternational Conference on Computational Linguistics (COLING), 2024

182

05 Dec 2024

From Language Models over Tokens to Language Models over Characters

447

04 Dec 2024

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

136

03 Dec 2024

Yi-Lightning Technical Report

...

Zonghong Dai

708

02 Dec 2024

A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series

272

01 Dec 2024

ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain

Ali Shiraee Kasmaee

Mohammad Khodadad

Mohammad Arshi Saloot

1.3K

30 Nov 2024

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization MethodsIEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2024

Burak Suyunu

Enes Taylan

Arzucan Özgür

234

26 Nov 2024

Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

681

23 Nov 2024

Context-Aware Multimodal PretrainingComputer Vision and Pattern Recognition (CVPR), 2024

353

22 Nov 2024

Why do language models perform worse for morphologically complex languages?

Catherine Arnett

Benjamin Bergen

230

21 Nov 2024

The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims

272

21 Nov 2024

Watermark under Fire: A Robustness Evaluation of LLM Watermarking

545

20 Nov 2024

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

Tim Elsner

Paula Usinger

Julius Nehring-Wirxel

255

15 Nov 2024

Xmodel-1.5: An 1B-scale Multilingual LLM

351

15 Nov 2024

Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Márton Szép

Daniel Rueckert

Rüdiger von Eisenhart-Rothe

Florian Hinterwimmer

SyDa ALM

576

14 Nov 2024

Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech RecognitionSpoken Language Technology Workshop (SLT), 2024

264

11 Nov 2024

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

Jacob Nielsen

Lukas Galke

Peter Schneider-Kamp

228

08 Nov 2024

Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

471

08 Nov 2024

Deploying Multi-task Online Server with Large Language ModelInternational Conference on Computational Linguistics (COLING), 2024

244

06 Nov 2024

Classification Done Right for Vision-Language Pre-TrainingNeural Information Processing Systems (NeurIPS), 2024

419

05 Nov 2024

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual InputsNeural Information Processing Systems (NeurIPS), 2024

339

04 Nov 2024

SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation

585

03 Nov 2024

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine TranslationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

Langlin Huang

Mengyu Bu

Yang Feng

255

03 Nov 2024

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient RetrievalIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024

478

01 Nov 2024

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

604

01 Nov 2024

MrT5: Dynamic Token Merging for Efficient Byte-level Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Julie Kallini

Shikhar Murty

Christopher D. Manning

Christopher Potts

Róbert Csordás

416

28 Oct 2024

From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages

...

212

24 Oct 2024

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

A. S. Rawat

Veeranjaneyulu Sadhanala

...

Sanjiv Kumar

465

24 Oct 2024

Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation

186

24 Oct 2024

Scalable Influence and Fact Tracing for Large Language Model PretrainingInternational Conference on Learning Representations (ICLR), 2024

307

22 Oct 2024

PLDR-LLM: Large Language Model from Power Law Decoder Representations

Burc Gokden

140

22 Oct 2024

Methods of improving LLM training stability

211

22 Oct 2024

Action abstractions for amortized samplingInternational Conference on Learning Representations (ICLR), 2024

Moksh Jain

Nikolay Malkin

Emmanuel Bengio

Rim Assouel

Yoshua Bengio

194

19 Oct 2024

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous TokensInternational Conference on Learning Representations (ICLR), 2024

Yuanzhen Li

Michael Rubinstein

325

110

17 Oct 2024

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Yichao Yan

Xiaokang Yang

252

17 Oct 2024

Nominal Class Assignment in Swahili: A Computational Account

Giada Palmieri

Konstantinos Kogkalidis

16 Oct 2024

Interpreting token compositionality in LLMs: A robustness analysis

Nura Aljaafari

Danilo S. Carvalho

André Freitas

433

16 Oct 2024

Tokenization and Morphology in Multilingual Language Models: A Comparative Analysis of mT5 and ByT5

Thao Anh Dang

Limor Raviv

Lukas Galke

310

15 Oct 2024

LargePiG: Your Large Language Model is Secretly a Pointer Generator

227

15 Oct 2024

Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank AdaptationsInformation Fusion (Inf. Fusion), 2024

730

15 Oct 2024

ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

104

14 Oct 2024

Language Model Embeddings Can Be Sufficient for Bayesian Optimization

357

14 Oct 2024

Text Classification using Graph Convolutional Networks: A Comprehensive SurveyACM Computing Surveys (ACM CSUR), 2024

Syed Mustafa Haider Rizvi

Ramsha Imran

Arif Mahmood

GNN OOD FaML

197

12 Oct 2024

Adapters for Altering LLM Vocabularies: What Languages Benefit the Most?International Conference on Learning Representations (ICLR), 2024

HyoJung Han

238

12 Oct 2024

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring ModelingNeural Information Processing Systems (NeurIPS), 2024

Fang Peng

434

10 Oct 2024