44
2

Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario

Abstract

In this paper, we reformulate overlapped speech diarization as a single-label prediction problem, which is always treated as a multi-label classification task in previous studies. Specifically, the multiple labels of each frame are encoded into a single label with the power set, which represents the possible combinations of different speakers. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) system. In SEND, the speech encoder, speaker encoder, similarity scores, and post-processing network are optimized to predict the power set encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that our method significantly outperforms the variational Bayesian hidden Markov model-based clustering algorithm (VBx). Besides, the proposed method has two benefits compared with the target-speaker voice activity detection (TSVAD). First, SEND can achieve lower diarization error rates in the real meeting scenario. Second, when the training data has a high overlap ratio, the learning process of SEND is more stable than TSVAD.

View on arXiv
Comments on this paper