Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2502.10361
Cited By
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
17 February 2025
Bettina Messmer
Vinko Sabolčec
Martin Jaggi
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Papers citing
"Enhancing Multilingual LLM Pretraining with Model-Based Data Selection"
6 / 6 papers shown
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
193
0
0
15 Oct 2025
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet
Abdellah Fourtassi
Akari Haga
Bastian Bunzeck
Bhargav Shandilya
...
Yurii Paniv
Ziyin Zhang
Arianna Bisazza
Alex Warstadt
Leshem Choshen
116
1
0
11 Oct 2025
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Inés Altemir Marinas
Anastasiia Kucherenko
Alexander Sternfeld
Andrei Kucharavy
113
0
0
10 Oct 2025
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Marc Marone
Orion Weller
William Fleshman
Eugene Yang
Dawn J Lawrie
Benjamin Van Durme
192
10
0
08 Sep 2025
Assessing the Role of Data Quality in Training Bilingual Language Models
Skyler Seto
Maartje ter Hoeve
Maureen de Seyssel
David Grangier
159
0
0
15 Jun 2025
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns
Letitia Parcalabescu
Stephan Wäldchen
Michael Barlow
Gregor Ziegltrum
Volker Stampa
Bastian Harren
Björn Deiseroth
SyDa
499
1
0
24 Apr 2025
1