ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.14788
61
0

Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

8 January 2025
Alexan Ayrapetyan
Sofia Kostandian
Ara Yeroyan
Mher Yerznkanyan
Nikolay Karpov
Nune Tadevosyan
Vitaly Lavrukhin
Boris Ginsburg
ArXivPDFHTML
Abstract

This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.

View on arXiv
@article{ayrapetyan2025_2501.14788,
  title={ Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages },
  author={ Alexan Ayrapetyan and Sofia Kostandian and Ara Yeroyan and Mher Yerznkanyan and Nikolay Karpov and Nune Tadevosyan and Vitaly Lavrukhin and Boris Ginsburg },
  journal={arXiv preprint arXiv:2501.14788},
  year={ 2025 }
}
Comments on this paper