ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.10361
  4. Cited By
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

17 February 2025
Bettina Messmer
Vinko Sabolčec
Martin Jaggi
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)

Papers citing "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection"

6 / 6 papers shown
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Areej AlOtaibi
Lina Alyahya
Raghad Alshabanah
Shahad Alfawzan
Shuruq Alarefei
...
Waad Alahmed
Omar Talabay
Jalal Alowibdi
Salem Alelyani
Adel Bibi
193
0
0
15 Oct 2025
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet
Abdellah Fourtassi
Akari Haga
Bastian Bunzeck
Bhargav Shandilya
...
Yurii Paniv
Ziyin Zhang
Arianna Bisazza
Alex Warstadt
Leshem Choshen
116
1
0
11 Oct 2025
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Inés Altemir Marinas
Anastasiia Kucherenko
Alexander Sternfeld
Andrei Kucharavy
113
0
0
10 Oct 2025
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Marc Marone
Orion Weller
William Fleshman
Eugene Yang
Dawn J Lawrie
Benjamin Van Durme
192
10
0
08 Sep 2025
Assessing the Role of Data Quality in Training Bilingual Language Models
Assessing the Role of Data Quality in Training Bilingual Language Models
Skyler Seto
Maartje ter Hoeve
Maureen de Seyssel
David Grangier
159
0
0
15 Jun 2025
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation
Thomas F Burns
Letitia Parcalabescu
Stephan Wäldchen
Michael Barlow
Gregor Ziegltrum
Volker Stampa
Bastian Harren
Björn Deiseroth
SyDa
499
1
0
24 Apr 2025
1