19
15

Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection

Abstract

Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level prediction, converting it into a sequence multi-label classification problem. A critical issue with the frame-based model is that it pursues the best frame-level prediction rather than the best event-level prediction. Besides, it needs post-processing and cannot be trained in an end-to-end way. This paper firstly presents the one-dimensional Detection Transformer (1D-DETR), inspired by Detection Transformer for image object detection. Furthermore, given the characteristics of SED, the audio query branch and a one-to-many matching strategy for fine-tuning the model are added to 1D-DETR to form Sound Event Detection Transformer (SEDT). To our knowledge, SEDT is the first event-based and end-to-end SED model. Experiments are conducted on the URBAN-SED dataset and the DCASE2019 Task4 dataset, and both show that SEDT can achieve competitive performance.

View on arXiv
Comments on this paper