Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed Languages

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

27 October 2023

Peng Xie

Kani Chen

ZiWei Chen

Kani Chen

Yang Wang

ArXiv (abs)PDF HTML Github

Main:4 Pages

3 Figures

Bibliography:1 Pages

Abstract

Recently Whisper has approached human-level robustness and accuracy in English automatic speech recognition (ASR), while in minor language and mixed language speech recognition, there remains a compelling need for further improvement. In this work, we present the impressive results of Whisper-MCE, our finetuned Whisper model, which was trained using our self-collected dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile, considering word error rate (WER) poses challenges when it comes to evaluating its effectiveness in minor language and mixed-language contexts, we present a novel rating mechanism. By comparing our model to the baseline whisper-large-v2 model, we demonstrate its superior ability to accurately capture the content of the original audio, achieve higher recognition accuracy, and exhibit faster recognition speed. Notably, our model outperforms other existing models in the specific task of recognizing mixed language.

View on arXiv

Comments on this paper