v1v2 (latest)

SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition

16 September 2024

Ming-Hao Hsu

Kuan Po Huang

ArXiv (abs)PDF HTML

Main:6 Pages

5 Figures

Bibliography:1 Pages

12 Tables

Appendix:8 Pages

Abstract

Automatic Speech Recognition (ASR) models demonstrate outstanding performance on high-resource languages but face significant challenges when applied to low-resource languages due to limited training data and insufficient cross-lingual generalization. Existing adaptation strategies, such as shallow fusion, data augmentation, and direct fine-tuning, either rely on external resources, suffer computational inefficiencies, or fail in test-time adaptation scenarios. To address these limitations, we introduce Speech Meta In-Context LEarning (SMILE), an innovative framework that combines meta-learning with speech in-context learning (SICL). SMILE leverages meta-training from high-resource languages to enable robust, few-shot generalization to low-resource languages without explicit fine-tuning on the target domain. Extensive experiments on the ML-SUPERB benchmark show that SMILE consistently outperforms baseline methods, significantly reducing character and word error rates in training-free few-shot multilingual ASR tasks.

View on arXiv

Comments on this paper