Sound event detection (SED) is the task of tagging the absence or presence of audio events and their corresponding interval within a given audio clip. While SED can be done using supervised machine learning, where training data is fully labeled with access to per event timestamps and duration, our work focuses on weakly-supervised sound event detection (WSSED), where prior knowledge about an event's duration is unavailable. Recent research within the field focuses on improving segment- and event-level localization performance for specific datasets regarding specific evaluation metrics. Specifically, well-performing event-level localization requires fully labeled development subsets to obtain event duration estimates, which significantly benefits localization performance. Moreover, well-performing segment-level localization models output predictions at a coarse-scale (e.g., 1 second), hindering their deployment on datasets containing very short events (< 1 second). This work proposes a duration robust CRNN (CDur) framework, which aims to achieve competitive performance in terms of segment- and event-level localization. This paper proposes a new post-processing strategy named "Triple Threshold" and investigates two data augmentation methods along with a label smoothing method within the scope of WSSED. Evaluation of our model is done on the DCASE2017 and 2018 Task 4 datasets, and URBAN-SED. Our model outperforms other approaches on the DCASE2018 and URBAN-SED datasets without requiring prior duration knowledge. In particular, our model is capable of similar performance to strongly-labeled supervised models on the URBAN-SED dataset. Lastly, ablation experiments to reveal that without post-processing, our model's localization performance drop is significantly lower compared with other approaches.
翻译:正确事件探测( SED) 是一项在特定音频剪辑中标记缺少或存在音频事件及其相应间隔的任务。 SED 可以通过监督的机器学习完成,其中培训数据完全贴上每件事件时间戳和持续时间的标签,而我们的工作重点是在对事件持续时间缺乏事先了解的情况下进行不严密监督的音频探测(WSED ) 。最近实地研究的重点是改进特定评价指标数据集的段段和事件级本地化绩效。具体地说,良好的事件级本地化需要贴上充分标签的开发子集,以获得事件持续时间估计,这大大有利于本地化绩效。此外,在粗度(例如,1秒)的分区化阶段(WSEDD)级模型中,运行良好的部分级化模型(CNNN(CD20)级本地化)框架,目的是在分级和事件级本地化方面实现竞争性绩效。 本文提出一个新的后处理战略,名为“Triple del Sender del del Sender Serview ”, 在不使用S 4 Seral Seral Serview Serview Serview Serview dal Ad Serviewd Serview Serviewd 方法, 。这项工作中, 。 。这项工作在不使用S-deal-de dal-de dal-de dal-de dal-de dal- dal-laveal- disal-laxxxx