ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.01547
85
1

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

3 February 2025
Andrew Rouditchenko
Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
ArXivPDFHTML
Abstract

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

View on arXiv
@article{rouditchenko2025_2502.01547,
  title={ mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition },
  author={ Andrew Rouditchenko and Samuel Thomas and Hilde Kuehne and Rogerio Feris and James Glass },
  journal={arXiv preprint arXiv:2502.01547},
  year={ 2025 }
}
Comments on this paper