Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 1,923 papers shown
Title
Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation
Esther Ploeger
Huiyuan Lai
Rik van Noord
Antonio Toral
34
1
0
30 Aug 2024
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions
Sully F. Chen
Robert J. Steele
Beakal Lemeneh
S. Lad
Eric Oermann
Eric K. Oermann
AI4CE
47
0
0
29 Aug 2024
Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough
Konstantin Dobler
Gerard de Melo
58
1
0
28 Aug 2024
Depth-Weighted Detection of Behaviours of Risk in People with Dementia using Cameras
Pratik K. Mishra
Irene Ballester
Andrea Iaboni
Bing Ye
Kristine Newman
Alex Mihailidis
Shehroz S. Khan
47
0
0
28 Aug 2024
Positional Description for Numerical Normalization
Deepanshu Gupta
Javier Latorre
3DGS
34
0
0
22 Aug 2024
Distributional Properties of Subword Regularization
Marco Cognetta
Vilém Zouhar
Naoaki Okazaki
43
0
0
21 Aug 2024
Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies
Sai Koneru
Matthias Huck
M. Exel
Jan Niehues
32
0
0
21 Aug 2024
Goldfish: Monolingual Language Models for 350 Languages
Tyler A. Chang
Catherine Arnett
Zhuowen Tu
Benjamin Bergen
LRM
51
4
0
19 Aug 2024
Language-Informed Beam Search Decoding for Multilingual Machine Translation
Yilin Yang
Stefan Lee
Prasad Tadepalli
38
1
0
11 Aug 2024
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
Aleksandr Fedchin
Isabel Cooperman
Pramit Chaudhuri
Joseph P. Dexter
46
0
0
08 Aug 2024
Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization
Zhou Qin
Kai Yuan
Pratik Lahiri
Wenyang Liu
BDL
45
0
0
08 Aug 2024
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Jacob A. Matthews
John R. Starr
Marten van Schijndel
40
2
0
08 Aug 2024
EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora
Faisal Qarah
40
3
0
07 Aug 2024
SETN: Stock Embedding Enhanced with Textual and Network Information
Takehiro Takayanagi
Hiroki Sakaji
Kiyoshi Izumi
AIFin
43
2
0
06 Aug 2024
Compromising Embodied Agents with Contextual Backdoor Attacks
Aishan Liu
Yuguang Zhou
Xianglong Liu
Tianyuan Zhang
Siyuan Liang
...
Tianlin Li
Junqi Zhang
Wenbo Zhou
Qing Guo
Dacheng Tao
LLMAG
AAML
47
9
0
06 Aug 2024
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization
Yiwen Chen
Yikai Wang
Yihao Luo
Zhilin Wang
Zilong Chen
Jun Zhu
Chi Zhang
Guosheng Lin
33
24
0
05 Aug 2024
Batching BPE Tokenization Merges
Alexander P. Morgan
37
0
0
05 Aug 2024
SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models
Shujuan Zhao
Lingfeng Qiao
Kangyang Luo
Qian-Wen Zhang
Junru Lu
Di Yin
AIFin
39
1
0
05 Aug 2024
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Shuhao Guan
Derek Greene
36
6
0
05 Aug 2024
Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features
Mengyu Bu
Shuhao Gu
Yang Feng
36
3
0
02 Aug 2024
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Huajian Zhang
Laura Perez-Beltrachini
HILM
44
0
0
01 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
42
681
0
31 Jul 2024
Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
Omer Nacar
Anis Koubaa
26
2
0
30 Jul 2024
Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
Brigita Vileikyt.e
M. Lukoševičius
Lukas Stankevicius
36
1
0
29 Jul 2024
Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation
Doan Nam Long Vu
Timour Igamberdiev
Ivan Habernal
55
0
0
26 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng-Wei Zhang
Tong He
VLM
42
4
0
25 Jul 2024
Coupling Speech Encoders with Downstream Text Models
Ciprian Chelba
J. Schalkwyk
AuLLM
45
0
0
24 Jul 2024
Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models
Yida Zhao
Chao Lou
Kewei Tu
56
0
0
24 Jul 2024
The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization
Samuele Cornell
Taejin Park
Steve Huang
Christoph Boeddeker
Xuankai Chang
Matthew Maciejewski
Sanjeev Khudanpur
Paola García
Shinji Watanabe
41
9
0
23 Jul 2024
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
Yuchen Li
Alexandre Kirchmeyer
Aashay Mehta
Yilong Qin
Boris Dadachev
Kishore Papineni
Sanjiv Kumar
Andrej Risteski
61
0
0
22 Jul 2024
CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units
Yeeun Kang
37
0
0
19 Jul 2024
Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
Lukuan Dong
Donghong Qin
Fengbo Bai
Fanhua Song
Yan Liu
Chen Xu
Zhijian Ou
31
0
0
18 Jul 2024
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
Jian You
Xiangfeng Li
28
1
0
18 Jul 2024
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang
Xuesong Niu
Nan Jiang
Ruimao Zhang
Siyuan Huang
35
9
0
17 Jul 2024
Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts
Lauren Levine
Cindy Tung Li
Lydia Bremer-McCollum
Nicholas Wagner
Amir Zeldes
RALM
31
1
0
17 Jul 2024
Beyond Next Token Prediction: Patch-Level Training for Large Language Models
Chenze Shao
Fandong Meng
Jie Zhou
53
1
0
17 Jul 2024
DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents
Thomas Constum
Pierrick Tranouez
Thierry Paquet
32
5
0
12 Jul 2024
Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models
Linghao Jin
Li An
Xuezhe Ma
34
0
0
12 Jul 2024
Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
Anton Alexandrov
Veselin Raychev
Mark Niklas Muller
Ce Zhang
Martin Vechev
Kristina Toutanova
MoMe
CLL
KELM
42
15
0
11 Jul 2024
Automata-based constraints for language model decoding
Terry Koo
Frederick Liu
Luheng He
AI4CE
52
16
0
11 Jul 2024
Learning Program Behavioral Models from Synthesized Input-Output Pairs
Tural Mammadov
Dietrich Klakow
Alexander Koller
Andreas Zeller
45
3
0
11 Jul 2024
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
Jupinder Parmar
Sanjev Satheesh
M. Patwary
M. Shoeybi
Bryan Catanzaro
56
29
0
09 Jul 2024
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
Shrimai Prabhumoye
Joseph Jennings
Bo Liu
Aastha Jhunjhunwala
Zhilin Wang
M. Patwary
M. Shoeybi
Bryan Catanzaro
53
6
0
08 Jul 2024
MST5 -- Multilingual Question Answering over Knowledge Graphs
Nikit Srivastava
Mengshi Ma
Daniel Vollmers
Hamada M. Zahera
Diego Moussallem
A. N. Ngomo
36
0
0
08 Jul 2024
An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models
Nandini Mundra
Aditya Nanda Kishore
Raj Dabre
Ratish Puduppully
Anoop Kunchukuttan
Mitesh Khapra
30
4
0
08 Jul 2024
Large Language Models Understand Layout
Weiming Li
Manni Duan
Dong An
Yan Shao
54
3
0
08 Jul 2024
Do Multilingual Large Language Models Mitigate Stereotype Bias?
Shangrui Nie
Michael Fromm
Charles F Welch
Rebekka Görge
Akbar Karimi
Joan Plepi
Nazia Afsan Mowmita
Nicolas Flores-Herr
Mehdi Ali
Lucie Flek
40
4
0
08 Jul 2024
Statistical investigations into the geometry and homology of random programs
Jon Sporring
Ken Friis Larsen
23
0
0
05 Jul 2024
Toucan: Many-to-Many Translation for 150 African Language Pairs
AbdelRahim Elmadany
Ife Adebara
Muhammad Abdul-Mageed
39
1
0
05 Jul 2024
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf
M. Baskar
Andrew Rosenberg
Bhuvana Ramabhadran
45
1
0
05 Jul 2024
Previous
1
2
3
4
5
...
37
38
39
Next