Specialized Decision Surface and Disentangled Feature for
Weakly-Supervised Polyphonic Sound Event Detection
Sound event detection (SED) consists in recognizing the presence of sound events in the segment of audio and detecting their onset as well as offset. In this paper, we focus on two common problems on SED: how to carry out efficient weakly-supervised learning and how to learn better from the unbalanced dataset in which multiple sound events often occur in co-occurrence. We approach SED as a multiple instance learning (MIL) problem and utilize a neural network framework with different pooling modules to solve it. General MIL approaches includes two approaches: the instance-level approach and the embedding-level approach. Since the embedding-level approach tends to perform better than the instance-level approach in terms of bag-level classification but can not provide instance-level probabilities, we present how to generate instance-level probabilities for it. Moreover, we further propose a specialized decision surface (SDS) for the embedding-level attention pooling. We analyze and explained why an embedding-level attention module with SDS is better than other typical pooling modules from the perspective of the high-level feature space. As for the problem of unbalanced dataset and the co-occurrence of multiple categories in the polyphonic event detection task, we propose a disentangled feature (DF) to reduce interference among categories, which optimizes the high-level feature space by disentangling it based on class-wise identifiable information and obtaining multiple different subspaces. Experiments on the dataset of DCASE 2018 Task 4 show that the proposed SDS and DF significantly improve the detection performance of the embedding-level MIL approach with an attention pooling module and outperform the first place system in the challenge by 6.2 percentage points.
View on arXiv