ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1808.06226
  4. Cited By
SentencePiece: A simple and language independent subword tokenizer and
  detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

19 August 2018
Taku Kudo
John Richardson
ArXiv (abs)PDFHTMLGithub (10925★)

Papers citing "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"

50 / 2,064 papers shown
Language Model Tokenizers Introduce Unfairness Between Languages
Language Model Tokenizers Introduce Unfairness Between LanguagesNeural Information Processing Systems (NeurIPS), 2023
Aleksandar Petrov
Emanuele La Malfa
Juil Sock
Adel Bibi
346
173
0
17 May 2023
Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum
  Bayes Risk Decoding for Machine Translation
Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine TranslationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Markus Freitag
Behrooz Ghorbani
Patrick Fernandes
212
57
0
17 May 2023
Sasha: Creative Goal-Oriented Reasoning in Smart Homes with Large
  Language Models
Sasha: Creative Goal-Oriented Reasoning in Smart Homes with Large Language ModelsProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2023
Evan King
Haoxiang Yu
Sangsu Lee
Christine Julien
LM&Ro
159
35
0
16 May 2023
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
AR-Diffusion: Auto-Regressive Diffusion Model for Text GenerationNeural Information Processing Systems (NeurIPS), 2023
Tong Wu
Zhihao Fan
Xiao Liu
Yeyun Gong
Yelong Shen
...
Juntao Li
Zhongyu Wei
Jian Guo
Nan Duan
Weizhu Chen
VLM
391
116
0
16 May 2023
Towards Speech Dialogue Translation Mediating Speakers of Different
  Languages
Towards Speech Dialogue Translation Mediating Speakers of Different LanguagesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Shuichiro Shimizu
Chenhui Chu
Sheng Li
Sadao Kurohashi Kyoto University
157
3
0
16 May 2023
Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling
  Corrector
Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector
Derguene Mbaye
Moussa Diallo
111
4
0
15 May 2023
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
MEGABYTE: Predicting Million-byte Sequences with Multiscale TransformersNeural Information Processing Systems (NeurIPS), 2023
L. Yu
Daniel Simig
Colin Flaherty
Armen Aghajanyan
Luke Zettlemoyer
M. Lewis
296
136
0
12 May 2023
Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*Portuguese Conference on Artificial Intelligence (EPIA), 2023
João Rodrigues
Luís Gomes
Joao Silva
António Branco
Rodrigo Santos
Henrique Lopes Cardoso
T. Osório
167
54
0
11 May 2023
What is the best recipe for character-level encoder-only modelling?
What is the best recipe for character-level encoder-only modelling?Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Kris Cao
149
6
0
09 May 2023
Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
  Code-Switched Data
Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched DataAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Robert Litschko
Ekaterina Artemova
Barbara Plank
203
8
0
09 May 2023
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
  for Speech Recognition
Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech RecognitionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Xuandi Fu
Kanthashree Mysore Sathyendra
Ankur Gandhe
Jing Liu
Grant P. Strimel
Ross McGowan
Athanasios Mouchtaris
296
21
0
09 May 2023
CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine
  Translation for Extremely Low-resource Languages
CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource LanguagesConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Kaushal Kumar Maurya
Rahul Kejriwal
M. Desarkar
Anoop Kunchukuttan
255
1
0
09 May 2023
Fast Conformer with Linearly Scalable Attention for Efficient Speech
  Recognition
Fast Conformer with Linearly Scalable Attention for Efficient Speech RecognitionAutomatic Speech Recognition & Understanding (ASRU), 2023
Dima Rekesh
Nithin Rao Koluguri
Samuel Kriman
Somshubra Majumdar
Vahid Noroozi
...
Oleksii Hrinchuk
Krishna Puvvada
Ankur Kumar
Jagadeesh Balam
Boris Ginsburg
330
144
0
08 May 2023
Leveraging Synthetic Targets for Machine Translation
Leveraging Synthetic Targets for Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Sarthak Mittal
Oleksii Hrinchuk
Oleksii Kuchaiev
147
2
0
07 May 2023
Two to Five Truths in Non-Negative Matrix Factorization
Two to Five Truths in Non-Negative Matrix FactorizationInternational Workshop on Complex Networks & Their Applications (CNTA), 2023
John M. Conroy
Neil P. Molino
Brian Baughman
Rod Gomez
Ryan Kaliszewski
Nicholas A. Lines
215
0
0
06 May 2023
Pre-training Language Model as a Multi-perspective Course Learner
Pre-training Language Model as a Multi-perspective Course LearnerAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Beiduo Chen
Shaohan Huang
Zi-qiang Zhang
Wu Guo
Zhen-Hua Ling
Haizhen Huang
Furu Wei
Weiwei Deng
Tao Gui
204
1
0
06 May 2023
Now It Sounds Like You: Learning Personalized Vocabulary On Device
Now It Sounds Like You: Learning Personalized Vocabulary On DeviceAAAI Spring Symposia (ASS), 2023
Sida Wang
Ashish Shenoy
P. Chuang
John Nguyen
VLM
320
5
0
05 May 2023
Investigating Lexical Sharing in Multilingual Machine Translation for
  Indian Languages
Investigating Lexical Sharing in Multilingual Machine Translation for Indian LanguagesEuropean Association for Machine Translation Conferences/Workshops (EAMT), 2023
Sonal Sannigrahi
Rachel Bawden
140
0
0
04 May 2023
Hybrid Transducer and Attention based Encoder-Decoder Modeling for
  Speech-to-Text Tasks
Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text TasksAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Yun Tang
Anna Y. Sun
Hirofumi Inaguma
Xinyue Chen
Ning Dong
Xutai Ma
Paden Tomasello
J. Pino
262
27
0
04 May 2023
What changes when you randomly choose BPE merge operations? Not much
What changes when you randomly choose BPE merge operations? Not muchFirst Workshop on Insights from Negative Results in NLP (Insights), 2023
Jonne Saleva
Constantine Lignos
146
11
0
04 May 2023
Learning Language-Specific Layers for Multilingual Machine Translation
Learning Language-Specific Layers for Multilingual Machine TranslationAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Telmo Pires
Robin M. Schmidt
Yi-Hsiu Liao
Stephan Peitz
256
22
0
04 May 2023
Towards Being Parameter-Efficient: A Stratified Sparsely Activated
  Transformer with Dynamic Capacity
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic CapacityConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Da Xu
Maha Elbayad
Kenton W. Murray
Jean Maillard
Vedanuj Goswami
MoE
229
5
0
03 May 2023
Low-Resourced Machine Translation for Senegalese Wolof Language
Low-Resourced Machine Translation for Senegalese Wolof Language
Derguene Mbaye
Moussa Diallo
T. Diop
163
5
0
01 May 2023
ResiDual: Transformer with Dual Residual Connections
ResiDual: Transformer with Dual Residual Connections
Shufang Xie
Huishuai Zhang
Junliang Guo
Xu Tan
Jiang Bian
Hany Awadalla
Arul Menezes
Tao Qin
Rui Yan
168
26
0
28 Apr 2023
Training and Evaluation of a Multilingual Tokenizer for GPT-SW3
Training and Evaluation of a Multilingual Tokenizer for GPT-SW3
Felix Stollenwerk
210
10
0
28 Apr 2023
mPLUG-Owl: Modularization Empowers Large Language Models with
  Multimodality
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLMMLLM
1.1K
1,164
0
27 Apr 2023
Semantic Tokenizer for Enhanced Natural Language Processing
Semantic Tokenizer for Enhanced Natural Language Processing
Sandeep Mehta
Darpan Shah
Ravindra Kulkarni
Cornelia Caragea
VLM
33
4
0
24 Apr 2023
NAIST-SIC-Aligned: an Aligned English-Japanese Simultaneous
  Interpretation Corpus
NAIST-SIC-Aligned: an Aligned English-Japanese Simultaneous Interpretation Corpus
Jinming Zhao
Yuka Ko
Kosuke Doi
Ryo Fukuda
Katsuhito Sudoh
Satoshi Nakamura
348
2
0
23 Apr 2023
Tokenization Preference for Human and Machine Learning Model: An
  Annotation Study
Tokenization Preference for Human and Machine Learning Model: An Annotation Study
Tatsuya Hiraoka
Tomoya Iwakura
169
1
0
21 Apr 2023
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary
  Restriction as Post Processing
Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing
Tatsuya Hiraoka
Tomoya Iwakura
119
0
0
21 Apr 2023
Joint Repetition Suppression and Content Moderation of Large Language
  Models
Joint Repetition Suppression and Content Moderation of Large Language Models
Minghui Zhang
Alex Sokolov
Weixin Cai
Si-Qing Chen
227
2
0
20 Apr 2023
MPMQA: Multimodal Question Answering on Product Manuals
MPMQA: Multimodal Question Answering on Product ManualsAAAI Conference on Artificial Intelligence (AAAI), 2023
Liangfu Zhang
Anwen Hu
Jing Zhang
Shuo Hu
Qin Jin
198
14
0
19 Apr 2023
UniMax: Fairer and more Effective Language Sampling for Large-Scale
  Multilingual Pretraining
UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual PretrainingInternational Conference on Learning Representations (ICLR), 2023
Hyung Won Chung
Noah Constant
Xavier Garcia
Adam Roberts
Yi Tay
Sharan Narang
Orhan Firat
282
101
0
18 Apr 2023
From Words to Music: A Study of Subword Tokenization Techniques in
  Symbolic Music Generation
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation
Adarsh Kumar
Pedro Sarmento
191
4
0
18 Apr 2023
Transfer to a Low-Resource Language via Close Relatives: The Case Study
  on Faroese
Transfer to a Low-Resource Language via Close Relatives: The Case Study on FaroeseNordic Conference of Computational Linguistics (NODALIDA), 2023
Vésteinn Snaebjarnarson
A. Simonsen
Goran Glavaš
Ivan Vulić
252
30
0
18 Apr 2023
A Survey for Biomedical Text Summarization: From Pre-trained to Large
  Language Models
A Survey for Biomedical Text Summarization: From Pre-trained to Large Language Models
Qianqian Xie
Zheheng Luo
Benyou Wang
Sophia Ananiadou
LM&MAVLM
211
14
0
18 Apr 2023
The MiniPile Challenge for Data-Efficient Language Models
The MiniPile Challenge for Data-Efficient Language Models
Jean Kaddour
MoEALM
320
63
0
17 Apr 2023
VECO 2.0: Cross-lingual Language Model Pre-training with
  Multi-granularity Contrastive Learning
VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning
Zhen-Ru Zhang
Chuanqi Tan
Songfang Huang
Fei Huang
VLM
156
6
0
17 Apr 2023
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
Yiming Cui
Ziqing Yang
Xin Yao
ALM
291
391
0
17 Apr 2023
Neural Machine Translation For Low Resource Languages
Neural Machine Translation For Low Resource Languages
V. Goyle
Parvathy Krishnaswamy
K. G. Ravikumar
Utsa Chattopadhyay
Kartikay Goyle
77
0
0
16 Apr 2023
Towards Better Instruction Following Language Models for Chinese:
  Investigating the Impact of Training Data and Evaluation
Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation
Yunjie Ji
Yan Gong
Yong Deng
Yiping Peng
Qiang Niu
Baochang Ma
Xiangang Li
ALMELM
240
28
0
16 Apr 2023
A CTC Alignment-based Non-autoregressive Transformer for End-to-end
  Automatic Speech Recognition
A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech RecognitionIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2023
Ruchao Fan
Wei Chu
Peng Chang
Abeer Alwan
173
18
0
15 Apr 2023
Sign Language Translation from Instructional Videos
Sign Language Translation from Instructional Videos
Laia Tarrés
Gerard I. Gállego
A. Duarte
Jordi Torres
Xavier Giró-i-Nieto
SLR
239
54
0
13 Apr 2023
Computational modeling of semantic change
Computational modeling of semantic changeConference of the European Chapter of the Association for Computational Linguistics (EACL), 2023
Nina Tahmasebi
Haim Dubossarsky
293
7
0
13 Apr 2023
Multilingual Machine Translation with Large Language Models: Empirical
  Results and Analysis
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
Wenhao Zhu
Hongyi Liu
Qingxiu Dong
Jingjing Xu
Shujian Huang
Lingpeng Kong
Jiajun Chen
Lei Li
LRM
369
226
0
10 Apr 2023
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for
  Translation with Semi-Supervised Pseudo-Parallel Document Generation
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation
Alireza Salemi
Amirhossein Abaskohi
Sara Tavakoli
Yadollah Yaghoobzadeh
A. Shakery
AIMat
227
0
0
03 Apr 2023
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical
  domains
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domainsbioRxiv (bioRxiv), 2023
Yanis Labrak
Adrien Bazoge
Richard Dufour
Mickael Rouvier
Emmanuel Morin
B. Daille
P. Gourraud
LM&MA
315
62
0
03 Apr 2023
GreekBART: The First Pretrained Greek Sequence-to-Sequence Model
GreekBART: The First Pretrained Greek Sequence-to-Sequence ModelInternational Conference on Language Resources and Evaluation (LREC), 2023
Iakovos Evdaimon
Hadi Abdine
Christos Xypolopoulos
Stamatis Outsios
Michalis Vazirgiannis
Giorgos Stamou
VLM
112
9
0
03 Apr 2023
Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam
  Detection
Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection
Maxime Labonne
Sean J. Moran
300
35
0
03 Apr 2023
Exploiting Multilingualism in Low-resource Neural Machine Translation
  via Adversarial Learning
Exploiting Multilingualism in Low-resource Neural Machine Translation via Adversarial Learning
Amit Kumar
A. Pratap
Anil Kumar Singh
AI4CE
186
2
0
31 Mar 2023
Previous
123...181920...404142
Next
Page 19 of 42
Pageof 42