6
0

Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription

Abstract

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

View on arXiv
@article{vieting2025_2309.08454,
  title={ Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription },
  author={ Peter Vieting and Simon Berger and Thilo von Neumann and Christoph Boeddeker and Ralf Schlüter and Reinhold Haeb-Umbach },
  journal={arXiv preprint arXiv:2309.08454},
  year={ 2025 }
}
Comments on this paper