Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1808.06226
Cited By
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
19 August 2018
Taku Kudo
John Richardson
Re-assign community
ArXiv (abs)
PDF
HTML
Github (10925★)
Papers citing
"SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing"
50 / 2,063 papers shown
An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging
Sulaiman Khan
Md. Rafiul Biswas
Alina Murad
Hazrat Ali
Zubair Shah
174
6
0
02 Jun 2024
μ
μ
μ
LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Benjamin Thérien
Charles-Étienne Joseph
Boris Knyazev
Edouard Oyallon
Irina Rish
Eugene Belilovsky
AI4CE
498
5
0
31 May 2024
How Multilingual Are Large Language Models Fine-Tuned for Translation?
Aquia Richburg
Marine Carpuat
LRM
175
7
0
30 May 2024
Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning
E. Chimoto
Jay Gala
Orevaoghene Ahia
Julia Kreutzer
Bruce A. Bassett
Sara Hooker
VLM
362
6
0
29 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
268
44
0
29 May 2024
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
Ge Zhang
Scott Qu
Jiaheng Liu
Chenchen Zhang
Chenghua Lin
...
Zi-Kai Zhao
Jiajun Zhang
Wanli Ouyang
Wenhao Huang
Lei Ma
ELM
311
72
0
29 May 2024
Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation
Langlin Huang
Yang Feng
247
2
0
29 May 2024
Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform
Viviane Potocnik
Luca Colagrande
Tim Fischer
L. Bertaccini
Daniele Jahier Pagliari
Luca Bompani
Luca Benini
297
4
0
29 May 2024
Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal Dataset
IEEE Transactions on Image Processing (TIP), 2024
Zhiyuan You
Jinjin Gu
Zheyuan Li
Xin Cai
Kaiwen Zhu
Chao Dong
Tianfan Xue
EGVM
460
38
0
29 May 2024
Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu
Radu Soricut
ViT
234
6
0
28 May 2024
Multi-objective Representation for Numbers in Clinical Narratives: A CamemBERT-Bio-Based Alternative to Large-Scale LLMs
Boammani Aser Lompo
Thanh-Dung Le
375
1
0
28 May 2024
Empowering Character-level Text Infilling by Eliminating Sub-Tokens
Houxing Ren
Mingjie Zhan
Zhongyuan Wu
Jiaming Song
AI4CE
172
2
0
27 May 2024
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
Dixuan Wang
Yanda Li
Junyuan Jiang
Zepeng Ding
Ziqin Luo
Guochao Jiang
Jiaqing Liang
Deqing Yang
482
33
0
27 May 2024
MoEUT: Mixture-of-Experts Universal Transformers
Róbert Csordás
Kazuki Irie
Jürgen Schmidhuber
Christopher Potts
Christopher D. Manning
MoE
256
28
0
25 May 2024
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Byung-Kwan Lee
Chae Won Kim
Beomchan Park
Yonghyun Ro
MLLM
LRM
339
28
0
24 May 2024
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
Xianzhi Du
Tom Gunter
Xiang Kong
Mark Lee
Zirui Wang
Aonan Zhang
Nan Du
Ruoming Pang
MoE
134
6
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
885
166
0
23 May 2024
Why Not Transform Chat Large Language Models to Non-English?
Xiang Geng
Ming Zhu
Jiahuan Li
Zhejian Lai
Wei Zou
...
Xinglin Lyu
M. Zhang
Jiajun Chen
Hao Yang
Shujian Huang
338
7
0
22 May 2024
Non-autoregressive real-time Accent Conversion model with voice cloning
Vladimir Nechaev
Sergey Kosyakov
233
3
0
21 May 2024
Targeted Multilingual Adaptation for Low-resource Language Families
C.M. Downey
Terra Blevins
Dhwani Serai
Dwija Parikh
Shane Steinert-Threlkeld
219
6
0
20 May 2024
FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes
Dawid Wi'sniewski
Zofia Rostek
Artur Nowakowski
244
0
0
20 May 2024
Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation
Kamil Guttmann
Miko Pokrywka
Adrian Charkiewicz
Artur Nowakowski
238
9
0
20 May 2024
Automated Radiology Report Generation: A Review of Recent Advances
IEEE Reviews in Biomedical Engineering (RBME), 2024
Phillip Sloan
Philip Clatworthy
Edwin Simpson
Majid Mirmehdi
249
63
0
17 May 2024
Libra: Building Decoupled Vision System on Large Language Models
International Conference on Machine Learning (ICML), 2024
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLM
VLM
195
10
0
16 May 2024
TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
International Conference on Computational Linguistics (COLING), 2024
Yihong Liu
Chunlan Ma
Haotian Ye
Hinrich Schütze
260
7
0
16 May 2024
Unsupervised Extractive Dialogue Summarization in Hyperdimensional Space
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Seongmin Park
Kyungho Kim
Jaejin Seo
Jihwa Lee
222
0
0
16 May 2024
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team
MLLM
583
629
0
16 May 2024
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Xianzheng Ma
Brandon Smart
Brandon Smart
Shuai Chen
Xinghui Li
...
Matthias Nießner
Ian D Reid
Angel X. Chang
Iro Laina
V. Prisacariu
LRM
367
30
0
16 May 2024
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Wanting Xu
Yang Liu
Langping He
Xucheng Huang
Ling Jiang
VLM
MLLM
211
5
0
15 May 2024
A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining
Masaaki Nagata
Makoto Morishita
Katsuki Chousa
Norihito Yasuda
160
3
0
15 May 2024
Challenges and Opportunities in Text Generation Explainability
Kenza Amara
Rita Sevastjanova
Mennatallah El-Assady
SILM
207
3
0
14 May 2024
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
International Conference on Machine Learning (ICML), 2024
Siyuan Li
Zedong Wang
Zicheng Liu
Di Wu
Cheng Tan
Jiangbin Zheng
Yufei Huang
Stan Z. Li
214
14
0
13 May 2024
An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation
International Conference on Language Resources and Evaluation (LREC), 2024
Supryadi Supryadi
Leiyu Pan
Deyi Xiong
171
0
0
13 May 2024
Zero-Shot Tokenizer Transfer
Neural Information Processing Systems (NeurIPS), 2024
Benjamin Minixhofer
Edoardo Ponti
Ivan Vulić
VLM
278
25
0
13 May 2024
DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger
Ofek Glick
Chaim Baskin
Yonatan Belinkov
318
0
0
13 May 2024
MedVersa: A Generalist Foundation Model for Medical Image Interpretation
Hong-Yu Zhou
Subathra Adithan
J. N. Acosta
Suvrankar Datta
E. Topol
Pranav Rajpurkar
MedIm
434
29
0
13 May 2024
Constructing a BPE Tokenization DFA
Martin Berglund
Willeke Martens
Brink van der Merwe
161
3
0
13 May 2024
SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora
Faisal Qarah
206
12
0
10 May 2024
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Nathaniel R. Robinson
Mary Dabre
Ammon Shurtz
Rasul Dent
Onenamiyi Onesi
...
Matthew Dean Stutzman
Bismarck Odoom
Sanjeev Khudanpur
Stephen D. Richardson
Kenton Murray
MoE
254
18
0
08 May 2024
Revisiting character-level adversarial attacks
Elias Abad Rocamora
Yongtao Wu
Fanghui Liu
Grigorios G. Chrysos
Volkan Cevher
AAML
244
6
0
07 May 2024
Position: Leverage Foundational Models for Black-Box Optimization
International Conference on Machine Learning (ICML), 2024
Xingyou Song
Yingtao Tian
Robert Tjarko Lange
Chansoo Lee
Yujin Tang
Yutian Chen
417
16
0
06 May 2024
Revisiting N-Gram Models: Their Impact in Modern Neural Networks for Handwritten Text Recognition
Solène Tarride
Christopher Kermorvant
169
1
0
30 Apr 2024
Unknown Script: Impact of Script on Cross-Lingual Transfer
Wondimagegnhue Tufa
Ilia Markov
Piek Vossen
382
2
0
29 Apr 2024
Decoding Radiologists' Intentions: A Novel System for Accurate Region Identification in Chest X-ray Image Analysis
Akash Awasthi
Safwan Ahmad
Bryant Le
Hien Nguyen
127
2
0
29 Apr 2024
A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system
Sunil Kumar Kopparapu
Ashish Panda
132
0
0
29 Apr 2024
PatentGPT: A Large Language Model for Intellectual Property
Zilong Bai
Ruiji Zhang
Linqing Chen
Qijun Cai
Yuan Zhong
...
Fu Bian
Xiaolong Gu
Lisha Zhang
Wentao Wu
Changyang Tu
444
8
0
28 Apr 2024
Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali
Nishant Luitel
Nirajan Bekoju
Anand Kumar Sah
Subarna Shakya
285
2
0
28 Apr 2024
Scaffold-BPE: Enhancing Byte Pair Encoding with Simple and Effective Scaffold Token Removal
Haoran Lian
Yizhe Xiong
Jianwei Niu
Shasha Mo
Zhenpeng Su
Zijia Lin
Peng Liu
Hui Chen
Guiguang Ding
225
2
0
27 Apr 2024
Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities
Kazuki Fujii
Taishi Nakamura
Mengsay Loem
Hiroki Iida
Masanari Ohi
Kakeru Hattori
Hirai Shota
Sakae Mizuki
Rio Yokota
Naoaki Okazaki
CLL
323
112
0
27 Apr 2024
Prefix Text as a Yarn: Eliciting Non-English Alignment in Foundation Language Model
Runzhe Zhan
Xinyi Yang
Yang Li
Lidia S. Chao
Yue Zhang
364
12
0
25 Apr 2024
Previous
1
2
3
...
8
9
10
...
40
41
42
Next