Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (10925★)
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 2,063 papers shown
AcrosticSleuth: Probabilistic Identification and Ranking of Acrostics in Multilingual Corpora
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Aleksandr Fedchin
Isabel Cooperman
Pramit Chaudhuri
Joseph P. Dexter
189
0
0
08 Aug 2024
EMTeC: A Corpus of Eye Movements on Machine-Generated Texts
Behavior Research Methods (BRM), 2024
Lena S. Bolliger
Patrick Haller
Isabelle Caroline Rose Cretton
D. R. Reich
Tannon Kew
Lena Ann Jäger
213
5
0
08 Aug 2024
Cooperative Multi-Agent Deep Reinforcement Learning in Content Ranking Optimization
Zhou Qin
Kai Yuan
Pratik Lahiri
Wenyang Liu
BDL
335
1
0
08 Aug 2024
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Jacob A. Matthews
John R. Starr
Marten van Schijndel
272
5
0
08 Aug 2024
EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora
Faisal Qarah
240
9
0
07 Aug 2024
SETN: Stock Embedding Enhanced with Textual and Network Information
Takehiro Takayanagi
Hiroki Sakaji
Kiyoshi Izumi
AIFin
288
2
0
06 Aug 2024
Compromising Embodied Agents with Contextual Backdoor Attacks
Aishan Liu
Yuguang Zhou
Xianglong Liu
Tianyuan Zhang
Yaning Tan
...
Tianlin Li
Junqi Zhang
Wenbo Zhou
Qing Guo
Dacheng Tao
LLMAG
AAML
250
26
0
06 Aug 2024
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization
Yiwen Chen
Yikai Wang
Yihao Luo
Liang Luo
Zilong Chen
Jun Zhu
Chi Zhang
Guosheng Lin
237
65
0
05 Aug 2024
Batching BPE Tokenization Merges
Alexander P. Morgan
154
0
0
05 Aug 2024
SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models
Shujuan Zhao
Lingfeng Qiao
Kangyang Luo
Qian-Wen Zhang
Junru Lu
Di Yin
AIFin
232
4
0
05 Aug 2024
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Shuhao Guan
Derek Greene
292
9
0
05 Aug 2024
Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Mengyu Bu
Shuhao Gu
Yang Feng
333
8
0
02 Aug 2024
Leveraging Entailment Judgements in Cross-Lingual Summarisation
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Huajian Zhang
Laura Perez-Beltrachini
HILM
201
2
0
01 Aug 2024
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
...
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
VLM
MoE
OSLM
620
1,566
0
31 Jul 2024
Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning
Omer Nacar
Anis Koubaa
156
6
0
30 Jul 2024
Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models
International Conference on Information Technology (ICIT), 2024
Brigita Vileikyt.e
M. Lukoševičius
Lukas Stankevicius
176
1
0
29 Jul 2024
Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation
Doan Nam Long Vu
Timour Igamberdiev
Ivan Habernal
173
1
0
26 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng Zhang
Tong He
VLM
207
7
0
25 Jul 2024
Coupling Speech Encoders with Downstream Text Models
Ciprian Chelba
J. Schalkwyk
AuLLM
177
0
0
24 Jul 2024
Dependency Transformer Grammars: Integrating Dependency Structures into Transformer Language Models
Yida Zhao
Chao Lou
Kewei Tu
212
2
0
24 Jul 2024
The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization
Samuele Cornell
Taejin Park
Steve Huang
Christoph Boeddeker
Xuankai Chang
Matthew Maciejewski
Sanjeev Khudanpur
Paola García
Shinji Watanabe
223
24
0
23 Jul 2024
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines
Yuchen Li
Alexandre Kirchmeyer
Aashay Mehta
Yilong Qin
Boris Dadachev
Kishore Papineni
Sanjiv Kumar
Andrej Risteski
306
4
0
22 Jul 2024
CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units
Yeeun Kang
201
2
0
19 Jul 2024
Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
Lukuan Dong
Donghong Qin
Fengbo Bai
Fanhua Song
Yan Liu
Chen Xu
Zhijian Ou
175
1
0
18 Jul 2024
A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR
Jian You
Xiangfeng Li
147
1
0
18 Jul 2024
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang
Xuesong Niu
Nan Jiang
Ruimao Zhang
Siyuan Huang
228
22
0
17 Jul 2024
Lacuna Language Learning: Leveraging RNNs for Ranked Text Completion in Digitized Coptic Manuscripts
Lauren Levine
Cindy Tung Li
Lydia Bremer-McCollum
Nicholas Wagner
Amir Zeldes
RALM
144
2
0
17 Jul 2024
DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents
Thomas Constum
Pierrick Tranouez
Thierry Paquet
180
8
0
12 Jul 2024
Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models
Linghao Jin
Li An
Xuezhe Ma
297
0
0
12 Jul 2024
Mitigating Catastrophic Forgetting in Language Transfer via Model Merging
Anton Alexandrov
Veselin Raychev
Mark Niklas Muller
Ce Zhang
Martin Vechev
Kristina Toutanova
MoMe
CLL
KELM
396
37
0
11 Jul 2024
Automata-based constraints for language model decoding
Terry Koo
Frederick Liu
Luheng He
AI4CE
363
40
0
11 Jul 2024
Learning Program Behavioral Models from Synthesized Input-Output Pairs
Tural Mammadov
Dietrich Klakow
Alexander Koller
Andreas Zeller
275
3
0
11 Jul 2024
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models
Jupinder Parmar
Sanjev Satheesh
M. Patwary
Mohammad Shoeybi
Bryan Catanzaro
276
52
0
09 Jul 2024
Data, Data Everywhere: A Guide for Pretraining Dataset Construction
Jupinder Parmar
Shrimai Prabhumoye
Pritam Gundecha
Bo Liu
Aastha Jhunjhunwala
Zhilin Wang
M. Patwary
Mohammad Shoeybi
Bryan Catanzaro
294
11
0
08 Jul 2024
MST5 -- Multilingual Question Answering over Knowledge Graphs
Nikit Srivastava
Mengshi Ma
Daniel Vollmers
Hamada M. Zahera
Diego Moussallem
A. N. Ngomo
154
5
0
08 Jul 2024
An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models
Nandini Mundra
Aditya Nanda Kishore
Mary Dabre
Ratish Puduppully
Anoop Kunchukuttan
Mitesh Khapra
194
17
0
08 Jul 2024
Large Language Models Understand Layout
Weiming Li
Manni Duan
Dong An
Yan Shao
322
6
0
08 Jul 2024
Do Multilingual Large Language Models Mitigate Stereotype Bias?
Shangrui Nie
Michael Fromm
Charles F Welch
Rebekka Görge
Akbar Karimi
Joan Plepi
Nazia Afsan Mowmita
Nicolas Flores-Herr
Mehdi Ali
Lucie Flek
334
13
0
08 Jul 2024
Statistical investigations into the geometry and homology of random programs
Jon Sporring
Ken Friis Larsen
99
0
0
05 Jul 2024
Toucan: Many-to-Many Translation for 150 African Language Pairs
AbdelRahim Elmadany
Ife Adebara
Muhammad Abdul-Mageed
256
5
0
05 Jul 2024
Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models
Bolaji Yusuf
M. Baskar
Andrew Rosenberg
Bhuvana Ramabhadran
169
2
0
05 Jul 2024
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR
Shashi Kumar
S. Madikeri
Juan Zuluaga-Gomez
Iuliia Nigmatulina
Esaú Villatoro-Tello
Sergio Burdisso
P. Motlícek
Karthik Pandia
A. Ganapathiraju
220
0
0
05 Jul 2024
Serialized Output Training by Learned Dominance
Ying Shi
Lantian Li
Shi Yin
D. Wang
Jiqing Han
141
7
0
04 Jul 2024
BM25S: Orders of magnitude faster lexical search via eager sparse scoring
Xing Han Lù
209
82
0
04 Jul 2024
Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations
Kunal Dhawan
Nithin Rao Koluguri
Ante Jukić
Ryan Langman
Jagadeesh Balam
Boris Ginsburg
222
13
0
03 Jul 2024
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
832
8
0
03 Jul 2024
Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data
Minato Kondo
T. Utsuro
Masaaki Nagata
CLL
212
8
0
03 Jul 2024
A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning
Ramakrishna Appicharla
Baban Gain
Santanu Pal
Asif Ekbal
Pushpak Bhattacharyya
204
3
0
03 Jul 2024
Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale
Wenzhen Zheng
Wenbo Pan
Xu Xu
Libo Qin
Li Yue
Ming Zhou
CLL
224
12
0
02 Jul 2024
How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation
Yan Meng
Di Wu
Christof Monz
345
4
0
02 Jul 2024
Previous
1
2
3
...
6
7
8
...
40
41
42
Next