ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.12187
17
0

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

28 February 2023
Xiaoming Ren
Chao Li
Shenjian Wang
Biao Li
ArXivPDFHTML
Abstract

Considering the bimodal nature of human speech perception, lips, and teeth movement has a pivotal role in automatic speech recognition. Benefiting from the correlated and noise-invariant visual information, audio-visual recognition systems enhance robustness in multiple scenarios. In previous work, audio-visual HuBERT appears to be the finest practice incorporating modality knowledge. This paper outlines a mixed methodology, named conformer enhanced AV-HuBERT, boosting the AV-HuBERT system's performance a step further. Compared with baseline AV-HuBERT, our method in the one-phase evaluation of clean and noisy conditions achieves 7% and 16% relative WER reduction on the English AVSR benchmark dataset LRS3. Furthermore, we establish a novel 1000h Mandarin AVSR dataset CSTS. On top of the baseline AV-HuBERT, we exceed the WeNet ASR system by 14% and 18% relatively on MISP and CMLR by pre-training with this dataset. The conformer-enhanced AV-HuBERT we proposed brings 7% on MISP and 6% CER reduction on CMLR, compared with the baseline AV-HuBERT system.

View on arXiv
Comments on this paper