ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.15916
200
1
v1v2 (latest)

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Annual Meeting of the Association for Computational Linguistics (ACL), 2025
21 February 2025
Jenalea Rajab
Anuoluwapo Aremu
Everlyn Asiko Chimoto
Dale Dunbar
Graham Morrissey
Fadel Thior
Luandrie Potgieter
Jessico Ojo
A. Tonja
Maushami Chetty
Onyothi Nekoto
Pelonomi Moiloa
Jade Z. Abbott
Vukosi Marivate
Benjamin Rosman
ArXiv (abs)PDFHTML
Main:9 Pages
4 Figures
Bibliography:2 Pages
5 Tables
Appendix:3 Pages
Abstract

This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vukúzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD's usability in building and refining voice-driven applications for isiXhosa.

View on arXiv
Comments on this paper