ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10267
38
1

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

13 March 2025
Laurie Burchell
Ona de Gibert
Nikolay Arefyev
Mikko Aulamo
Marta Bañón
and Pinzhen Chen
Mariia Fedorova
Liane Guillou
Barry Haddow
Jan Hajič
and Jindřich Helcl
Erik Henriksson
Mateusz Klimaszewski
Ville Komulainen
and Andrey Kutuzov
Joona Kytöniemi
Veronika Laippala
Petter Mæhlum
and Bhavitvya Malik
Farrokh Mehryary
Vladislav Mikhailov
Nikita Moghe
A. Myntti
Dayyán O'Brien
Stephan Oepen
Proyag Pal
Jousia Piha
and Sampo Pyysalo
Gema Ramírez-Sánchez
David Samuel
Pavel Stepachev
and Jörg Tiedemann
Dušan Variš
Tereza Vojtěchová
Jaume Zaragoza-Bernabeu
ArXivPDFHTML
Abstract

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

View on arXiv
@article{burchell2025_2503.10267,
  title={ An Expanded Massive Multilingual Dataset for High-Performance Language Technologies },
  author={ Laurie Burchell and Ona de Gibert and Nikolay Arefyev and Mikko Aulamo and Marta Bañón and Pinzhen Chen and Mariia Fedorova and Liane Guillou and Barry Haddow and Jan Hajič and Jindřich Helcl and Erik Henriksson and Mateusz Klimaszewski and Ville Komulainen and Andrey Kutuzov and Joona Kytöniemi and Veronika Laippala and Petter Mæhlum and Bhavitvya Malik and Farrokh Mehryary and Vladislav Mikhailov and Nikita Moghe and Amanda Myntti and Dayyán O'Brien and Stephan Oepen and Proyag Pal and Jousia Piha and Sampo Pyysalo and Gema Ramírez-Sánchez and David Samuel and Pavel Stepachev and Jörg Tiedemann and Dušan Variš and Tereza Vojtěchová and Jaume Zaragoza-Bernabeu },
  journal={arXiv preprint arXiv:2503.10267},
  year={ 2025 }
}
Comments on this paper