Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2502.10361
Cited By

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

17 February 2025

Bettina Messmer

Vinko Sabolčec

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Papers citing "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection"

6 / 6 papers shown

Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

Raghad Alshabanah

Shahad Alfawzan

Shuruq Alarefei

...

193

0

0

15 Oct 2025

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Abdellah Fourtassi

Bastian Bunzeck

Bhargav Shandilya

...

Arianna Bisazza

116

1

0

11 Oct 2025

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World

Inés Altemir Marinas

Anastasiia Kucherenko

Alexander Sternfeld

Andrei Kucharavy

113

0

0

10 Oct 2025

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

William Fleshman

Benjamin Van Durme

192

10

0

08 Sep 2025

Assessing the Role of Data Quality in Training Bilingual Language Models

Assessing the Role of Data Quality in Training Bilingual Language Models

Maartje ter Hoeve

Maureen de Seyssel

159

0

0

15 Jun 2025

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

Letitia Parcalabescu

Stephan Wäldchen

Gregor Ziegltrum

Björn Deiseroth

499

1

0

24 Apr 2025