Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition
- SSL

Speech Emotion Recognition (SER) has seen significant progress with deep learning, yet remains challenging for Low-Resource Languages (LRLs) due to the scarcity of annotated data. In this work, we explore unsupervised learning to improve SER in low-resource settings. Specifically, we investigate contrastive learning (CL) and Bootstrap Your Own Latent (BYOL) as self-supervised approaches to enhance cross-lingual generalization. Our methods achieve notable F1 score improvements of 10.6% in Urdu, 15.2% in German, and 13.9% in Bangla, demonstrating their effectiveness in LRLs. Additionally, we analyze model behavior to provide insights on key factors influencing performance across languages, and also highlighting challenges in low-resource SER. This work provides a foundation for developing more inclusive, explainable, and robust emotion recognition systems for underrepresented languages.
View on arXiv@article{gong2025_2506.02059, title={ Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition }, author={ Ziwei Gong and Pengyuan Shi and Kaan Donbekci and Lin Ai and Run Chen and David Sasu and Zehui Wu and Julia Hirschberg }, journal={arXiv preprint arXiv:2506.02059}, year={ 2025 } }