Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition

1 June 2025

Main:4 Pages

3 Figures

Bibliography:2 Pages

2 Tables

Abstract

Speech Emotion Recognition (SER) has seen significant progress with deep learning, yet remains challenging for Low-Resource Languages (LRLs) due to the scarcity of annotated data. In this work, we explore unsupervised learning to improve SER in low-resource settings. Specifically, we investigate contrastive learning (CL) and Bootstrap Your Own Latent (BYOL) as self-supervised approaches to enhance cross-lingual generalization. Our methods achieve notable F1 score improvements of 10.6% in Urdu, 15.2% in German, and 13.9% in Bangla, demonstrating their effectiveness in LRLs. Additionally, we analyze model behavior to provide insights on key factors influencing performance across languages, and also highlighting challenges in low-resource SER. This work provides a foundation for developing more inclusive, explainable, and robust emotion recognition systems for underrepresented languages.

View on arXiv

@article{gong2025_2506.02059,
  title={ Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition },
  author={ Ziwei Gong and Pengyuan Shi and Kaan Donbekci and Lin Ai and Run Chen and David Sasu and Zehui Wu and Julia Hirschberg },
  journal={arXiv preprint arXiv:2506.02059},
  year={ 2025 }
}

Comments on this paper