Targeted Subset Selection for Limited-data ASR Accent Adaptation

We study the task of adapting an existing ASR model to a non-native accent while being constrained by a transcription budget on the duration of utterances selected from a large unlabeled corpus. We propose a subset selection approach using the recently proposed submodular mutual information functions, in which we identify a diverse set of utterances that match the target accent. This is specified through a few target utterances and achieved by modelling the relationship between the target and the selected subsets using these functions. The model adapts to the accent through fine-tuning with utterances selected and transcribed from the unlabeled corpus. We also use an accent classifier to learn accent-aware feature representations. Our method is also able to exploit samples from other accents to perform out-of-domain selections for low-resource accents which are not available in these corpora. We show that the targeted subset selection approach improves significantly upon random sampling - by around 5% to 10% (absolute) in most cases, and is around 10x more label-efficient. We also compare with an oracle method where we specifically pick from the target accent and our method is comparable to the oracle in its selections and WER performance.
View on arXiv