Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 1,923 papers shown
Title
Scaling Embedding Layers in Language Models
Da Yu
Edith Cohen
Badih Ghazi
Yangsibo Huang
Pritish Kamath
Ravi Kumar
Daogao Liu
Chiyuan Zhang
87
0
0
03 Feb 2025
A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
Yacouba Kaloga
Shashi Kumar
P. Motlícek
Ina Kodrasi
OT
74
0
0
03 Feb 2025
Vision-centric Token Compression in Large Language Model
Ling Xing
Alex Jinpeng Wang
Rui Yan
Xiangbo Shu
Jinhui Tang
VLM
65
0
0
02 Feb 2025
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor
Jonathan Mamou
Daniel Korat
Moshe Berchansky
Oren Pereg
Gaurav Jain
Roy Schwartz
Moshe Wasserblat
David Harel
98
2
0
31 Jan 2025
BLR-MoE: Boosted Language-Routing Mixture of Experts for Domain-Robust Multilingual E2E ASR
Guodong Ma
Wenxuan Wang
Lifeng Zhou
Yuting Yang
Yuke Li
Binbin Du
MoE
79
0
0
22 Jan 2025
aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing
Siyuan Jiang
Jia Li
He Zong
Huanyu Liu
Hao Zhu
...
Wei Ning
G. Wang
Yihong Dong
Kechi Zhang
Ge Li
ALM
72
1
0
17 Jan 2025
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
Sheng Zhang
Yanbo Xu
Naoto Usuyama
Hanwen Xu
J. Bagga
...
Carlo Bifulco
M. Lungren
Tristan Naumann
Sheng Wang
Hoifung Poon
LM&MA
MedIm
154
208
0
10 Jan 2025
Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Zhi Qu
Yiran Wang
Jiannan Mao
Chenchen Ding
Hideki Tanaka
Masao Utiyama
Taro Watanabe
LRM
42
0
0
06 Jan 2025
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing
Jinqiao Wang
Kai Wang
Wenjie Qu
Wenjie Zhang
Xiwei Xu
Xuemin Lin
41
1
0
04 Jan 2025
Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Tsz Kin Lam
Marco Gaido
Sara Papi
L. Bentivogli
Barry Haddow
36
0
0
04 Jan 2025
A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation
Xiaoqian Liu
Yangfan Du
Rongxiang Weng
Yuan Ge
Chen Xu
Tong Xiao
Guocheng Chen
Jingbo Zhu
41
0
0
31 Dec 2024
ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition
Seungdong Yoa
Seungjun Lee
Hyeseung Cho
Bumsoo Kim
Woohyung Lim
ViT
75
0
0
21 Dec 2024
ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling
William Jongwon Han
Chaojing Duan
M. Rosenberg
Emerson Liu
Ding Zhao
79
0
0
18 Dec 2024
Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation
Samin Mahdizadeh Sani
Pouya Sadeghi
Thuy-Trang Vu
Yadollah Yaghoobzadeh
Gholamreza Haffari
78
2
0
17 Dec 2024
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Changan Chen
Juze Zhang
S. K. Lakshmikanth
Yusu Fang
Ruizhi Shao
Gordon Wetzstein
L. Fei-Fei
Ehsan Adeli
VGen
84
3
0
13 Dec 2024
Efficient Continual Pre-training of LLMs for Low-resource Languages
Arijit Nag
Soumen Chakrabarti
Animesh Mukherjee
Niloy Ganguly
82
0
0
13 Dec 2024
Multi-Head Encoding for Extreme Label Classification
Daojun Liang
Haixia Zhang
Dongfeng Yuan
Minggao Zhang
75
0
0
13 Dec 2024
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
Davor Lauc
77
0
0
12 Dec 2024
Scaling Sequential Recommendation Models with Transformers
Pablo Zivic
Hernán Ceferino Vázquez
Jorge Sanchez
OffRL
LRM
79
1
0
10 Dec 2024
From Language Models over Tokens to Language Models over Characters
Tim Vieira
Ben LeBrun
Mario Giulianelli
Juan Luis Gastaldi
Brian DuSell
John Terilla
Timothy J. O'Donnell
Ryan Cotterell
81
8
0
04 Dec 2024
Yi-Lightning Technical Report
01. AI
:
Alan Wake
Albert Wang
Bei Chen
...
Yuxuan Sha
Zhaodong Yan
Zhiyuan Liu
Zirui Zhang
Zonghong Dai
OSLM
102
3
0
02 Dec 2024
A Wave is Worth 100 Words: Investigating Cross-Domain Transferability in Time Series
Xiangkai Ma
Xiaobin Hong
Wenzhong Li
Sanglu Lu
AI4TS
64
0
0
01 Dec 2024
ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain
Ali Shiraee Kasmaee
Mohammad Khodadad
Mohammad Arshi Saloot
Nick Sherck
Stephen Dokas
H. Mahyar
Soheila Samiee
ELM
263
0
0
30 Nov 2024
Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods
Burak Suyunu
Enes Taylan
Arzucan Özgür
67
2
0
26 Nov 2024
Efficient Online Inference of Vision Transformers by Training-Free Tokenization
Leonidas Gee
Wing Yan Li
V. Sharmanska
Novi Quadrianto
ViT
93
0
0
23 Nov 2024
The Master-Slave Encoder Model for Improving Patent Text Summarization: A New Approach to Combining Specifications and Claims
Shu Zhou
Xin Wang
Zhengda Zhou
Haohan Yi
Xuhui Zheng
Hao Wan
84
1
0
21 Nov 2024
WaterPark: A Robustness Assessment of Language Model Watermarking
Jiacheng Liang
Zian Wang
Lauren Hong
Shouling Ji
Ting Wang
AAML
111
0
0
20 Nov 2024
Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation
Tim Elsner
Paula Usinger
Julius Nehring-Wirxel
Gregor Kobsik
Victor Czech
Yanjiang He
I. Lim
Leif Kobbelt
39
1
0
15 Nov 2024
Xmodel-1.5: An 1B-scale Multilingual LLM
Wang Qun
Liu Yang
Lin Qingquan
Jiang Ling
LRM
44
0
0
15 Nov 2024
A Practical Guide to Fine-tuning Language Models with Limited Data
Márton Szép
Daniel Rueckert
Rüdiger von Eisenhart-Rothe
Florian Hinterwimmer
SyDa
ALM
49
2
0
14 Nov 2024
Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition
Yoshiki Masuyama
Koichi Miyazaki
Masato Murata
Mamba
43
0
0
11 Nov 2024
When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization
Jacob Nielsen
Lukas Galke
Peter Schneider-Kamp
MQ
45
1
0
08 Nov 2024
Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Miguel Moura Ramos
Tomás Almeida
Daniel Vareta
Filipe Azevedo
Sweta Agrawal
Patrick Fernandes
André F. T. Martins
38
1
0
08 Nov 2024
Deploying Multi-task Online Server with Large Language Model
Yincen Qu
Chao Ma
Xiangying Dai
Hui Zhou
Yiting Wu
Hengyue Liu
31
0
0
06 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
50
2
0
05 Nov 2024
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
A. Haliassos
Rodrigo Mira
Honglie Chen
Zoe Landgraf
Stavros Petridis
Maja Pantic
SSL
37
6
0
04 Nov 2024
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation
Langlin Huang
Mengyu Bu
Yang Feng
38
0
0
03 Nov 2024
SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
Dennis Fucci
Marco Gaido
Beatrice Savoldi
Matteo Negri
Mauro Cettolo
L. Bentivogli
57
1
0
03 Nov 2024
Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval
Nikolaos Flemotomos
Roger Hsiao
P. Swietojanski
Takaaki Hori
Dogan Can
Xiaodan Zhuang
51
0
0
01 Nov 2024
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Julie Kallini
Shikhar Murty
Christopher D. Manning
Christopher Potts
Róbert Csordás
40
2
0
28 Oct 2024
From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
Artur Kiulian
Anton Polishko
M. Khandoga
Yevhen Kostiuk
Guillermo Gabrielli
...
Hrishikesh Garud
Wendy Wing Yee Mak
Dmytro Chaplynskyi
Selma Belhadj Amor
Grigol Peradze
37
0
0
24 Oct 2024
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
A. S. Rawat
Veeranjaneyulu Sadhanala
Afshin Rostamizadeh
Ayan Chakrabarti
Wittawat Jitkrittum
...
Rakesh Shivanna
Sashank J. Reddi
A. Menon
Rohan Anil
Sanjiv Kumar
33
2
0
24 Oct 2024
Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation
Krzysztof Ociepa
Łukasz Flis
Krzysztof Wróbel
Adrian Gwoździej
Remigiusz Kinas
27
1
0
24 Oct 2024
Scalable Influence and Fact Tracing for Large Language Model Pretraining
Tyler A. Chang
Dheeraj Rajagopal
Tolga Bolukbasi
Lucas Dixon
Ian Tenney
TDI
37
2
0
22 Oct 2024
PLDR-LLM: Large Language Model from Power Law Decoder Representations
Burc Gokden
31
1
0
22 Oct 2024
Methods of improving LLM training stability
Oleg Rybakov
Mike Chrzanowski
Peter Dykas
Jinze Xue
Ben Lanir
28
1
0
22 Oct 2024
Action abstractions for amortized sampling
Oussama Boussif
Léna Néhale Ezzine
J. Viviano
Michał Koziarski
Moksh Jain
Nikolay Malkin
Emmanuel Bengio
Rim Assouel
Yoshua Bengio
32
0
0
19 Oct 2024
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
Lijie Fan
Tianhong Li
Siyang Qin
Yuanzhen Li
Chen Sun
Michael Rubinstein
Deqing Sun
Kaiming He
Yonglong Tian
VLM
DiffM
48
42
0
17 Oct 2024
MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations
Liang Xu
Shaoyang Hua
Zili Lin
Yifan Liu
Feipeng Ma
Yichao Yan
Xin Jin
Xiaokang Yang
Wenjun Zeng
VGen
39
3
0
17 Oct 2024
Nominal Class Assignment in Swahili: A Computational Account
Giada Palmieri
Konstantinos Kogkalidis
14
0
0
16 Oct 2024
Previous
1
2
3
4
5
...
37
38
39
Next