Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level prediction, converting it into a sequence multi-label classification problem. A critical issue with the frame-based model is that it pursues the best frame-level prediction rather than the best event-level prediction. Besides, it needs post-processing and cannot be trained in an end-to-end way. This paper firstly presents the one-dimensional Detection Transformer (1D-DETR), inspired by Detection Transformer for image object detection. Furthermore, given the characteristics of SED, the audio query branch and a one-to-many matching strategy for fine-tuning the model are added to 1D-DETR to form Sound Event Detection Transformer (SEDT). To our knowledge, SEDT is the first event-based and end-to-end SED model. Experiments are conducted on the URBAN-SED dataset and the DCASE2019 Task4 dataset, and both show that SEDT can achieve competitive performance.
翻译:SED的现有模型主要产生框架级预测,将其转化为多标签分类问题。基于框架的模型的一个关键问题是,它追求最佳框架级预测,而不是最佳事件级预测。此外,它需要后处理,无法接受端到端方式的培训。本文首先展示了在图像物体探测的探测变异器的启发下,在图像物体探测的探测变异器下产生的单维检测变异器(1D-DETR)。此外,鉴于SEDD的特性,音频查询分支和微调模型的一对一匹配战略被添加到 1D-DETR 中,以形成音频事件探测变异器(SEDT) 。据我们所知,SEDDT是第一个基于事件和端到端SED的模型。对URBAN-SED数据集和DCASE2019任务4数据集进行了实验,并且都表明SIDT能够取得竞争性的性能。