ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,064 papers shown
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual
  Language Modeling
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling
Tomasz Limisiewicz
Terra Blevins
Hila Gonen
Orevaoghene Ahia
Luke Zettlemoyer
305
30
0
15 Mar 2024
DiPaCo: Distributed Path Composition
DiPaCo: Distributed Path Composition
Arthur Douillard
Qixuang Feng
Andrei A. Rusu
A. Kuncoro
Yani Donchev
Rachita Chhaparia
Ionel Gog
MarcÁurelio Ranzato
Jiajun Shen
Arthur Szlam
MoE
235
6
0
15 Mar 2024
Frozen Feature Augmentation for Few-Shot Image Classification
Frozen Feature Augmentation for Few-Shot Image Classification
Andreas Bär
N. Houlsby
Mostafa Dehghani
Manoj Kumar
VLM
285
16
0
15 Mar 2024
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast
  Conformer
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast ConformerIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Maxime Burchi
Krishna C. Puvvada
Jagadeesh Balam
Boris Ginsburg
Radu Timofte
216
17
0
14 Mar 2024
Token Alignment via Character Matching for Subword Completion
Token Alignment via Character Matching for Subword CompletionAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Ben Athiwaratkun
Shiqi Wang
Mingyue Shang
Yuchen Tian
Zijian Wang
Sujan Kumar Gonugondla
Sanjay Krishna Gouda
Rob Kwiatowski
Ramesh Nallapati
Bing Xiang
192
9
0
13 Mar 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLMLLMAG
597
841
0
13 Mar 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Beyond Text: Frozen Large Language Models in Visual Signal ComprehensionComputer Vision and Pattern Recognition (CVPR), 2024
Lei Zhu
Fangyun Wei
Yanye Lu
MLLMVLM
222
30
0
12 Mar 2024
Masked AutoDecoder is Effective Multi-Task Vision Generalist
Masked AutoDecoder is Effective Multi-Task Vision GeneralistComputer Vision and Pattern Recognition (CVPR), 2024
Han Qiu
Jiaxing Huang
Shiyang Feng
Lewei Lu
Xiaoqin Zhang
Shijian Lu
213
5
0
12 Mar 2024
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
MAMMOTH: Massively Multilingual Modular Open Translation @ HelsinkiConference of the European Chapter of the Association for Computational Linguistics (EACL), 2024
Timothee Mickus
Stig-Arne Gronroos
Joseph Attieh
M. Boggia
Ona de Gibert
Shaoxiong Ji
Niki Andreas Lopi
Alessandro Raganato
Ananda Sreenidhi
Jörg Tiedemann
220
3
0
12 Mar 2024
Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting
  Applications
Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting ApplicationsThe Speaker and Language Recognition Workshop (Odyssey), 2024
Can Cui
Imran Ahmad Sheikh
Mostafa Sadeghi
Emmanuel Vincent
308
3
0
11 Mar 2024
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Michael Andersland
94
0
0
11 Mar 2024
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning
  using ULMFiT
Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiTACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 2022
Aisha Khatun
Anisur Rahman
Md. Saiful Islam
Hemayet Ahmed Chowdhury
A. Tasnim
177
4
0
08 Mar 2024
To Err Is Human, but Llamas Can Learn It Too
To Err Is Human, but Llamas Can Learn It TooConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Agnes Luhtaru
Taido Purason
Martin Vainikko
Maksym Del
Mark Fishel
SyDaALM
230
4
0
08 Mar 2024
FFSTC: Fongbe to French Speech Translation Corpus
FFSTC: Fongbe to French Speech Translation CorpusInternational Conference on Language Resources and Evaluation (LREC), 2024
D. F. Kponou
F. Laleye
E. C. Ezin
192
2
0
08 Mar 2024
Cross-lingual Transfer or Machine Translation? On Data Augmentation for
  Monolingual Semantic Textual Similarity
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual SimilarityInternational Conference on Language Resources and Evaluation (LREC), 2024
Shochro Hoshino
Akihiko Kato
Soichiro Murakami
Peinan Zhang
164
1
0
08 Mar 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLMLRM
833
766
0
07 Mar 2024
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?International Conference on Learning Representations (ICLR), 2024
Ibrahim Alabdulmohsin
Xiao Wang
Andreas Steiner
Priya Goyal
Alexander DÁmour
Xiao-Qi Zhai
211
30
0
07 Mar 2024
gaHealth: An English-Irish Bilingual Corpus of Health Data
gaHealth: An English-Irish Bilingual Corpus of Health Data
Séamus Lankford
Haithem Afli
Orla Ni Loinsigh
Andy Way
251
12
0
06 Mar 2024
BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine
  Translation
BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation
Carinne Cherf
Yuval Pinter
88
1
0
06 Mar 2024
Towards Training A Chinese Large Language Model for Anesthesiology
Towards Training A Chinese Large Language Model for Anesthesiology
Zhonghai Wang
Jie Jiang
Yibing Zhan
Bohao Zhou
Yanhong Li
...
Liang Ding
Hua Jin
Jun Peng
Xu Lin
Weifeng Liu
LM&MA
176
4
0
05 Mar 2024
adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource
  Languages with Integrated LLM Playgrounds
adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds
Séamus Lankford
Haithem Afli
Andy Way
202
37
0
04 Mar 2024
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
Mathilde Caron
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
353
7
0
04 Mar 2024
Transformers for Low-Resource Languages:Is Féidir Linn!
Transformers for Low-Resource Languages:Is Féidir Linn!
Séamus Lankford
H. Alfi
Tamás Sarlós
276
25
0
04 Mar 2024
adaptNMT: an open-source, language-agnostic development environment for
  Neural Machine Translation
adaptNMT: an open-source, language-agnostic development environment for Neural Machine Translation
Séamus Lankford
Haithem Afli
Andy Way
247
4
0
04 Mar 2024
Human Evaluation of English--Irish Transformer-Based NMT
Human Evaluation of English--Irish Transformer-Based NMT
Séamus Lankford
Haithem Afli
Andy Way
232
14
0
04 Mar 2024
Revisiting Dynamic Evaluation: Online Adaptation for Large Language
  Models
Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models
Amal Rannen-Triki
J. Bornschein
Razvan Pascanu
Marcus Hutter
Andras Gyorgy
Alexandre Galashov
Yee Whye Teh
Michalis K. Titsias
KELM
144
4
0
03 Mar 2024
Align-to-Distill: Trainable Attention Alignment for Knowledge
  Distillation in Neural Machine Translation
Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation
Heegon Jin
Seonil Son
Jemin Park
Youngseok Kim
Hyungjong Noh
Yeonsoo Lee
334
4
0
03 Mar 2024
VNLP: Turkish NLP Package
VNLP: Turkish NLP Package
Meliksah Turker
Mehmet Erdi Ari
Aydin Han
159
3
0
02 Mar 2024
VBART: The Turkish LLM
VBART: The Turkish LLM
Meliksah Turker
Mehmet Erdi Ari
Aydin Han
VLM
179
7
0
02 Mar 2024
Machine Translation in the Covid domain: an English-Irish case study for
  LoResMT 2021
Machine Translation in the Covid domain: an English-Irish case study for LoResMT 2021
Séamus Lankford
Haithem Afli
Andy Way
192
13
0
02 Mar 2024
Rethinking Tokenization: Crafting Better Tokenizers for Large Language
  Models
Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models
Jinbiao Yang
LLMAG
260
13
0
01 Mar 2024
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn
  Medical Interview
Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview
Heyang Liu
Yu Wang
Yanfeng Wang
278
0
0
01 Mar 2024
Compact Speech Translation Models via Discrete Speech Units Pretraining
Compact Speech Translation Models via Discrete Speech Units Pretraining
Tsz Kin Lam
Alexandra Birch
Barry Haddow
352
3
0
29 Feb 2024
Robust Guidance for Unsupervised Data Selection: Capturing Perplexing
  Named Entities for Domain-Specific Machine Translation
Robust Guidance for Unsupervised Data Selection: Capturing Perplexing Named Entities for Domain-Specific Machine Translation
Seunghyun Ji
Steve Andreas Immanuel
Darongsae Kwon
355
1
0
29 Feb 2024
Beyond Language Models: Byte Models are Digital World Simulators
Beyond Language Models: Byte Models are Digital World Simulators
Shangda Wu
Xu Tan
Zili Wang
Rui Wang
Xiaobing Li
Maosong Sun
139
22
0
29 Feb 2024
Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*
Advancing Generative AI for Portuguese with Open Decoder Gervásio PT*
Rodrigo Santos
Joao Silva
Luís Gomes
João Rodrigues
António Branco
213
17
0
29 Feb 2024
Tokenization Is More Than Compression
Tokenization Is More Than Compression
Craig W. Schmidt
Varshini Reddy
Haoran Zhang
Alec Alameddine
Omri Uzan
Yuval Pinter
Chris Tanner
357
65
0
28 Feb 2024
A Language Model based Framework for New Concept Placement in Ontologies
A Language Model based Framework for New Concept Placement in Ontologies
Hang Dong
Jiaoyan Chen
Yuan He
Yongsheng Gao
Ian Horrocks
243
9
0
27 Feb 2024
BioT5+: Towards Generalized Biological Understanding with IUPAC
  Integration and Multi-task Tuning
BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
Qizhi Pei
Lijun Wu
Ran Bi
Xiaozhuan Liang
Yin Fang
Jinhua Zhu
Shufang Xie
Tao Qin
Rui Yan
AI4CE
411
56
0
27 Feb 2024
Nemotron-4 15B Technical Report
Nemotron-4 15B Technical Report
Jupinder Parmar
Shrimai Prabhumoye
Pritam Gundecha
M. Patwary
Sandeep Subramanian
...
Ashwath Aithal
Oleksii Kuchaiev
Mohammad Shoeybi
Jonathan Cohen
Bryan Catanzaro
225
30
0
26 Feb 2024
Generative AI in Vision: A Survey on Models, Metrics and Applications
Generative AI in Vision: A Survey on Models, Metrics and Applications
Gaurav Raut
Apoorv Singh
VLMMedIm
223
13
0
26 Feb 2024
Quantum Transformer: Accelerating model inference via quantum linear algebra
Quantum Transformer: Accelerating model inference via quantum linear algebra
Naixu Guo
Zhan Yu
Matthew Choi
Yizhan Han
Aman Agrawal
Kouhei Nakaji
Alán Aspuru-Guzik
Patrick Rebentrost
AI4CE
382
21
0
26 Feb 2024
Pfeed: Generating near real-time personalized feeds using precomputed
  embedding similarities
Pfeed: Generating near real-time personalized feeds using precomputed embedding similarities
B. Gebre
Karoliina Ranta
S. V. D. Elzen
Ernst Kuiper
Thijs Baars
Tom Heskes
222
1
0
25 Feb 2024
ArabianGPT: Native Arabic GPT-based Large Language Model
ArabianGPT: Native Arabic GPT-based Large Language Model
Anis Koubaa
Adel Ammar
L. Ghouti
Omar Najar
Serry Sibaee
LM&MA
204
9
0
23 Feb 2024
Representing Online Handwriting for Recognition in Large Vision-Language
  Models
Representing Online Handwriting for Recognition in Large Vision-Language Models
Anastasiia Fadeeva
Philippe Schlattner
Andrii Maksai
Mark Collier
Efi Kokiopoulou
Jesse Berent
C. Musat
288
7
0
23 Feb 2024
Fine-tuning Large Language Models for Domain-specific Machine
  Translation
Fine-tuning Large Language Models for Domain-specific Machine Translation
Jiawei Zheng
Hanghai Hong
Xiaoli Wang
Jingsong Su
Yonggui Liang
Shikai Wu
ALM
192
63
0
23 Feb 2024
How Important Is Tokenization in French Medical Masked Language Models?
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak
Adrien Bazoge
B. Daille
Mickael Rouvier
Richard Dufour
214
1
0
22 Feb 2024
The Impact of Word Splitting on the Semantic Content of Contextualized
  Word Representations
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler
Matthieu Labeau
Chloé Clavel
VLM
224
5
0
22 Feb 2024
OmniPred: Language Models as Universal Regressors
OmniPred: Language Models as Universal Regressors
Xingyou Song
Oscar Li
Chansoo Lee
Bangding Yang
Daiyi Peng
Sagi Perel
Yutian Chen
444
34
0
22 Feb 2024
Subobject-level Image Tokenization
Subobject-level Image Tokenization
Delong Chen
Samuel Cahyawijaya
Jianfeng Liu
Baoyuan Wang
Pascale Fung
VLMOCL
629
19
0
22 Feb 2024
Previous
123...101112...404142
Next