144

Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed Languages

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
ZiWei Chen
Kani Chen
Yang Wang
Main:4 Pages
3 Figures
Bibliography:1 Pages
Abstract

Recently Whisper has approached human-level robustness and accuracy in English automatic speech recognition (ASR), while in minor language and mixed language speech recognition, there remains a compelling need for further improvement. In this work, we present the impressive results of Whisper-MCE, our finetuned Whisper model, which was trained using our self-collected dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile, considering word error rate (WER) poses challenges when it comes to evaluating its effectiveness in minor language and mixed-language contexts, we present a novel rating mechanism. By comparing our model to the baseline whisper-large-v2 model, we demonstrate its superior ability to accurately capture the content of the original audio, achieve higher recognition accuracy, and exhibit faster recognition speed. Notably, our model outperforms other existing models in the specific task of recognizing mixed language.

View on arXiv
Comments on this paper