Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem, which inevitably brings a trade-off between event boundary detection and audio tagging when using weakly labeled data to train the model. Besides, it needs post-processing and cannot be trained in an end-to-end way. This paper firstly presents the 1D Detection Transformer (1D-DETR), inspired by Detection Transformer. Furthermore, given the characteristics of SED, the audio query and a one-to-many matching strategy for fine-tuning the model are added to 1D-DETR to form the model of Sound Event Detection Transformer (SEDT), which generates event-level predictions, end-to-end detection. Experiments are conducted on the URBAN-SED dataset and the DCASE2019 Task4 dataset, and both experiments have achieved competitive results compared with SOTA models. The application of SEDT on SED shows that it can be used as a framework for one-dimensional signal detection and may be extended to other similar tasks.
翻译:SED的现有模型主要产生框架级预测,将其转换成一个序列多标签分类问题,这不可避免地在使用贴有标签的薄弱数据来训练模型时使事件边界探测和音频标记之间产生权衡。此外,它需要后处理,无法接受端到端方式的培训。本文首先介绍了1D探测变异器(1D-DETR),这是由Setective变异器所启发的。此外,鉴于SEDD的特性,音频查询和微调模型的一至倍匹配战略被添加到1D-DETR,形成事件探测变异器(SEDT)的模型,产生事件级预测,端到端检测。对UBAN-SED数据集和DCASE2019任务4数据集进行了实验,这两项实验都取得了与SOTA模型相比的竞争结果。SED对SD的应用表明,它可以用作一维信号检测的框架,并可以推广到其他任务。