Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

20 October 2024

Abstract

Grammatical error correction (GEC) aims to improve quality and readability of texts through accurate correction of linguistic mistakes. Previous work has focused on high-resource languages, while low-resource languages lack robust tools. However, low-resource languages often face problems such as: non-standard orthography, limited annotated corpora, and diverse dialects, which slows down the development of GEC tools. We present a study on GEC for Zarma, spoken by over five million in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated them using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95. 82% and a suggestion accuracy of 78. 90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs -- MT5-small -- showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

View on arXiv

@article{keita2025_2410.15539,
  title={ Grammatical Error Correction for Low-Resource Languages: The Case of Zarma },
  author={ Mamadou K. Keita and Christopher Homan and Marcos Zampieri and Adwoa Bremang and Habibatou Abdoulaye Alfari and Elysabhete Amadou Ibrahim and Dennis Owusu },
  journal={arXiv preprint arXiv:2410.15539},
  year={ 2025 }
}

Comments on this paper