Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

21 May 2022

Papers citing "Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese"

3 / 3 papers shown

Title
Data Processing for the OpenGPT-X Model Family Nicolo' Brandizzi Hammam Abdelwahab Anirban Bhowmick Lennard Helmer Benny Jörg Stein ... Georg Rehm Dennis Wegener Nicolas Flores-Herr Joachim Kohler Johannes Leveling VLM 79 2 0 11 Oct 2024
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models Phillip Rust Jonas Pfeiffer Ivan Vulić Sebastian Ruder Iryna Gurevych 69 235 0 31 Dec 2020
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models Benjamin Muller Antonis Anastasopoulos Benoît Sagot Djamé Seddah LRM 126 165 0 24 Oct 2020