Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 1,923 papers shown
Title
FreeMesh: Boosting Mesh Generation with Coordinates Merging
Jian Liu
Haohan Weng
Biwen Lei
Xianghui Yang
Zibo Zhao
Zhuo Chen
Song Guo
Tao Han
Chunchao Guo
20
0
0
19 May 2025
Neural Morphological Tagging for Nguni Languages
Cael Marquard
Simbarashe Mawere
Francois Meyer
7
0
0
19 May 2025
TiSpell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation
Yutong Liu
Feng Xiao
Ziyue Zhang
Yongbin Yu
Cheng Huang
...
Thupten Tsering
Cheng Huang
Gadeng Luosang
Renzeng Duojie
Nyima Tashi
31
0
0
12 May 2025
GIF: Generative Inspiration for Face Recognition at Scale
Saeed Ebrahimi
Sahar Rahimi
Ali Dabouei
Srinjoy Das
Jeremy M. Dawson
Nasser M. Nasrabadi
CVBM
216
0
0
05 May 2025
Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
Piotr Piekos
Róbert Csordás
Jürgen Schmidhuber
MoE
VLM
106
1
0
01 May 2025
Fast and Low-Cost Genomic Foundation Models via Outlier Removal
Haozheng Luo
Chenghao Qiu
Maojiang Su
Zhihan Zhou
Zoe Mehta
Guo Ye
Jerry Yao-Chieh Hu
Han Liu
AAML
55
1
0
01 May 2025
Improving Informally Romanized Language Identification
Adrian Benton
Alexander Gutkin
Christo Kirov
Brian Roark
55
0
0
30 Apr 2025
Modes of Sequence Models and Learning Coefficients
Zhongtian Chen
Daniel Murfet
90
1
0
25 Apr 2025
Tokenization Matters: Improving Zero-Shot NER for Indic Languages
Priyaranjan Pattnayak
Hitesh Laxmichand Patel
Amit Agarwal
37
0
0
23 Apr 2025
Compass-V2 Technical Report
Sophia Maria
MoE
LRM
41
0
0
22 Apr 2025
HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Enes Özeren
Yihong Liu
Hinrich Schütze
36
0
0
21 Apr 2025
Kuwain 1.5B: An Arabic SLM via Language Injection
Khalil Hennara
Sara Chrouf
Mohamed Motaism Hamed
Zeina Aldallal
Omar Hadid
Safwan AlModhayan
37
1
0
21 Apr 2025
Sparks of Science: Hypothesis Generation Using Structured Paper Data
Charles OÑeill
Tirthankar Ghosal
Roberta Răileanu
Mike Walmsley
Thang Bui
Kevin Schawinski
I. Ciucă
LRM
56
0
0
17 Apr 2025
EarthGPT-X: Enabling MLLMs to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Yaqian Ning
Tao Zhang
Yin Zhuang
He Chen
Jun Li
Xuerui Mao
38
0
0
17 Apr 2025
MorphTok: Morphologically Grounded Tokenization for Indian Languages
Maharaj Brahma
NJ Karthika
A. Singh
D. Adiga
Smruti Bhate
Ganesh Ramakrishnan
Rohit Saluja
Maunendra Sankar Desarkar
34
0
0
14 Apr 2025
RNN-Transducer-based Losses for Speech Recognition on Noisy Targets
Vladimir Bataev
35
0
0
09 Apr 2025
High-Resource Translation:Turning Abundance into Accessibility
Abhiram Reddy Yanampally
24
0
0
08 Apr 2025
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention
Andrew Kiruluta
Priscilla Burity
Samantha Williams
33
3
0
08 Apr 2025
GOLLuM: Gaussian Process Optimized LLMs -- Reframing LLM Finetuning through Bayesian Optimization
Bojana Ranković
P. Schwaller
BDL
241
0
0
08 Apr 2025
JarvisIR: Elevating Autonomous Driving Perception with Intelligent Image Restoration
Yunlong Lin
Zixu Lin
Haoyu Chen
Panwang Pan
C. Li
Sixiang Chen
Yeying Jin
W. J. Li
Xinghao Ding
30
1
0
05 Apr 2025
Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction
Daniel Becking
Ingo Friese
Karsten Müller
Thomas Buchholz
Mandy Galkow-Schneider
Wojciech Samek
D. Marpe
36
0
0
02 Apr 2025
Enhancing Embedding Representation Stability in Recommendation Systems with Semantic ID
Carolina Zheng
Minhui Huang
Dmitrii Pedchenko
Kaushik Rangadurai
S. Wang
...
Yiping Han
Lin Yang
Hangjun Xu
Rong Jin
Shuang Yang
38
0
0
02 Apr 2025
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu
Juze Zhang
Changan Chen
Tiange Xiang
Yusu Fang
Juan Carlos Niebles
Ehsan Adeli
VGen
54
0
0
28 Mar 2025
Tokenization of Gaze Data
Tim Rolff
Jurik Karimian
Niklas Hypki
S. Schmidt
Markus Lappe
Frank Steinicke
41
0
0
28 Mar 2025
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Alex Jinpeng Wang
Linjie Li
Zheng Yang
Lijuan Wang
Min Li
DiffM
73
0
0
26 Mar 2025
Named Entity Recognition in Context
Colin Brisson
Ayoub Kahfy
Marc Bui
Frédéric Constant
56
0
0
26 Mar 2025
Gemma 3 Technical Report
Gemma Team
Aishwarya B Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
...
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
VLM
93
47
0
25 Mar 2025
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
Benjamin Minixhofer
Ivan Vulić
Edoardo Ponti
226
0
0
25 Mar 2025
Payload-Aware Intrusion Detection with CMAE and Large Language Models
Yongcheol Kim
Chanjae Lee
Young Yoon
49
0
0
23 Mar 2025
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
M. Bommarito
Daniel Martin Katz
Jillian Bommarito
42
1
0
21 Mar 2025
Self-Vocabularizing Training for Neural Machine Translation
Pin-Jie Lin
Ernie Chang
Yangyang Shi
Vikas Chandra
71
0
0
18 Mar 2025
SuperBPE: Space Travel for Language Models
Alisa Liu
J. Hayase
Valentin Hofmann
Sewoong Oh
Noah A. Smith
Yejin Choi
51
3
0
17 Mar 2025
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
Lijie Fan
Luming Tang
Siyang Qin
Tianhong Li
Xuan S. Yang
...
Tao Zhu
Michael Rubinstein
Michalis Raptis
Deqing Sun
Radu Soricut
60
5
0
17 Mar 2025
Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility
Jacob Chmura
Jonah Dauvet
Sebastian Sabry
59
0
0
16 Mar 2025
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models
Julian Spravil
Sebastian Houben
Sven Behnke
VLM
78
0
0
12 Mar 2025
BPQA Dataset: Evaluating How Well Language Models Leverage Blood Pressures to Answer Biomedical Questions
Chi Hang
Ruiqi Deng
L. Jiang
Zihao Yang
Anton Alyakin
Daniel Alber
E. Oermann
AI4MH
LM&MA
47
0
0
06 Mar 2025
On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Catherine Arnett
Tyler A. Chang
J. Michaelov
Benjamin Bergen
43
0
0
05 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALM
ELM
251
0
0
04 Mar 2025
SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
Nikolay Blagoev
Lydia Yiyu Chen
Oğuzhan Ersoy
57
0
0
27 Feb 2025
(Mis)Fitting: A Survey of Scaling Laws
Margaret Li
Sneha Kudugunta
Luke Zettlemoyer
71
3
0
26 Feb 2025
A City of Millions: Mapping Literary Social Networks At Scale
Sil Hamilton
Rebecca M. M. Hicke
David M. Mimno
Matthew Wilkens
GNN
240
1
0
26 Feb 2025
Lost in Space: Optimizing Tokens for Grammar-Constrained Decoding
Sil Hamilton
David Mimno
76
0
0
24 Feb 2025
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
R. N. Nandi
Sagor Sarker
Quazi Sarwar Muhtaseem
Md. Kowsher
Apu Chandraw Shill
Md Ibrahim
Mehadi Hasan Menon
Tareq Al Muntasir
Firoj Alam
68
0
0
24 Feb 2025
Scaling Laws for Downstream Task Performance in Machine Translation
Berivan Isik
Natalia Ponomareva
Hussein Hazimeh
Dimitris Paparas
Sergei Vassilvitskii
Sanmi Koyejo
113
4
0
24 Feb 2025
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
Lior Belenki
Alekh Agarwal
Tianze Shi
Kristina Toutanova
MoE
60
0
0
21 Feb 2025
Deterministic Reversible Data Augmentation for Neural Machine Translation
Jiashu Yao
Heyan Huang
Zeming Liu
Yuhang Guo
51
0
0
21 Feb 2025
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
Burc Gokden
52
0
0
19 Feb 2025
Baichuan-M1: Pushing the Medical Capability of Large Language Models
Binghai Wang
Haizhou Zhao
Huozhi Zhou
Liang Song
Mingyu Xu
...
Yan Zhang
Yifei Duan
Yuyan Zhou
Zhi-Ming Ma
Zhikai Wu
LM&MA
ELM
AI4MH
42
4
0
18 Feb 2025
From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Jian Jia
Jingtong Gao
Ben Xue
Junhao Wang
Qingpeng Cai
Quan Chen
Xiangyu Zhao
Peng Jiang
Kun Gai
OffRL
77
0
0
18 Feb 2025
Enhancing LLM Character-Level Manipulation via Divide and Conquer
Zhen Xiong
Yujun Cai
Bryan Hooi
Nanyun Peng
Kai-Wei Chang
Zhecheng Li
70
0
0
12 Feb 2025
1
2
3
4
...
37
38
39
Next