Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.08832
Cited By
Impact of Tokenization on Language Models: An Analysis for Turkish
19 April 2022
Cagri Toraman
E. Yilmaz
Furkan Şahinuç
Oguzhan Ozcelik
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Impact of Tokenization on Language Models: An Analysis for Turkish"
32 / 32 papers shown
Title
GFT: Gradient Focal Transformer
Boris Kriuk
Simranjit Kaur Gill
Shoaib Aslam
Amir Fakhrutdinov
31
0
0
14 Apr 2025
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz
Hendra Setiawan
Stephan Peitz
Yova Kementchedjhieva
38
0
0
02 Apr 2025
From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time
Mikkel Wildner Kildeberg
Emil Allerslev Schledermann
Nicolaj Larsen
Rob van der Goot
31
0
0
02 Apr 2025
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Florian Eichin
Y. Liu
Barbara Plank
Michael A. Hedderich
39
0
0
13 Mar 2025
UniNet: A Unified Multi-granular Traffic Modeling Framework for Network Security
Binghui Wu
D. Divakaran
M. Gurusamy
57
0
0
06 Mar 2025
Efficient Continual Pre-training of LLMs for Low-resource Languages
Arijit Nag
Soumen Chakrabarti
Animesh Mukherjee
Niloy Ganguly
77
0
0
13 Dec 2024
Morphological Typology in BPE Subword Productivity and Language Modeling
Iñigo Parra
31
0
0
31 Oct 2024
Evaluating Morphological Compositional Generalization in Large Language Models
Mete Ismayilzada
Defne Çirci
Jonne Sälevä
Hale Sirin
Abdullatif Köksal
Bhuwan Dhingra
Antoine Bosselut
Lonneke van der Plas
Duygu Ataman
26
2
0
16 Oct 2024
The Fair Language Model Paradox
Andrea Pinto
Tomer Galanti
Randall Balestriero
23
0
0
15 Oct 2024
Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language
Sagar Tamang
Dibya Jyoti Bora
33
3
0
28 Sep 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
40
2
0
06 Sep 2024
Towards General Industrial Intelligence: A Survey on IIoT-Enhanced Continual Large Models
Jiao Chen
Jiayi He
Fangfang Chen
Zuohong Lv
Jianhua Tang
Weihua Li
Zuozhu Liu
Howard H. Yang
Guangjie Han
AI4CE
34
1
0
02 Sep 2024
LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language
Cagri Toraman
VLM
30
5
0
13 May 2024
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound
Haohe Liu
Xuenan Xu
Yiitan Yuan
Mengyue Wu
Wenwu Wang
Mark D. Plumbley
27
18
0
30 Apr 2024
Can Perplexity Predict Fine-Tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali
Nishant Luitel
Nirajan Bekoju
Anand Kumar Sah
Subarna Shakya
42
0
0
28 Apr 2024
Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
Catherine Arnett
Pamela D. Rivière
Tyler A. Chang
Sean Trott
24
2
0
20 Mar 2024
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
M. Alrefaie
Nour Eldin Morsy
Nada Samir
23
6
0
17 Mar 2024
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
56
17
0
28 Feb 2024
How Important Is Tokenization in French Medical Masked Language Models?
Yanis Labrak
Adrien Bazoge
B. Daille
Mickael Rouvier
Richard Dufour
28
1
0
22 Feb 2024
An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
Atsuki Yamaguchi
Aline Villavicencio
Nikolaos Aletras
19
7
0
16 Feb 2024
A Language Model for Particle Tracking
Andris Huang
Yash Melkani
P. Calafiura
Alina Lazar
D. Murnane
Minh-Tuan Pham
Xiangyang Ju
28
7
0
14 Feb 2024
Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing
Vilém Zouhar
AAML
35
0
0
29 Jan 2024
A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models
Yi Zhou
Jose Camacho-Collados
Danushka Bollegala
81
6
0
19 Oct 2023
Core Building Blocks: Next Gen Geo Spatial GPT Application
Ashley Fernandez
Swaraj Dube
16
4
0
17 Oct 2023
Tokenizer Choice For LLM Training: Negligible or Crucial?
Mehdi Ali
Michael Fromm
Klaudia Thellmann
Richard Rutmann
Max Lübbering
...
Malte Ostendorff
Samuel Weinbach
R. Sifa
Stefan Kesselheim
Nicolas Flores-Herr
21
47
0
12 Oct 2023
MorphPiece : A Linguistic Tokenizer for Large Language Models
Jeffrey Hsu
13
3
0
14 Jul 2023
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
T. Fujii
Koki Shibata
Atsuki Yamaguchi
Terufumi Morishita
Yasuhiro Sogawa
16
13
0
16 Jun 2023
Effects of sub-word segmentation on performance of transformer language models
Jue Hou
Anisia Katinskaia
Anh Vu
R. Yangarber
13
4
0
09 May 2023
Harnessing the Power of BERT in the Turkish Clinical Domain: Pretraining Approaches for Limited Data Scenarios
Hazal Türkmen
Oğuz Dikenelli
C. Eraslan
Mehmet Cem Çalli
S. Özbek
16
2
0
05 May 2023
Understanding BLOOM: An empirical study on diverse NLP tasks
Parag Dakle
Sai Krishna Rallabandi
Preethi Raghavan
AI4CE
31
3
0
27 Nov 2022
Pretrained Transformers for Text Ranking: BERT and Beyond
Jimmy J. Lin
Rodrigo Nogueira
Andrew Yates
VLM
219
608
0
13 Oct 2020
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
228
31,244
0
16 Jan 2013
1