ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.13971
7
0

The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition

20 May 2025
Ming Gao
Shilong Wu
Hang Chen
Jun Du
Chin-Hui Lee
Shinji Watanabe
Jingdong Chen
Siniscalchi Sabato Marco
O. Scharenborg
ArXivPDFHTML
Abstract

Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.

View on arXiv
@article{gao2025_2505.13971,
  title={ The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition },
  author={ Ming Gao and Shilong Wu and Hang Chen and Jun Du and Chin-Hui Lee and Shinji Watanabe and Jingdong Chen and Siniscalchi Sabato Marco and Odette Scharenborg },
  journal={arXiv preprint arXiv:2505.13971},
  year={ 2025 }
}
Comments on this paper