In this technical report, the systems we submitted for subtask 4 of the DCASE 2021 challenge, regarding sound event detection, are described in detail. These models are closely related to the baseline provided for this problem, as they are essentially convolutional recurrent neural networks trained in a mean teacher setting to deal with the heterogeneous annotation of the supplied data. However, the time resolution of the predictions was adapted to deal with the fact that these systems are evaluated using two intersection-based metrics involving different needs in terms of temporal localization. This was done by optimizing the pooling operations. For the first of the defined evaluation scenarios, imposing relatively strict requirements on the temporal localization accuracy, our best model achieved a PSDS score of 0.3609 on the validation data. This is only marginally better than the performance obtained by the baseline system (0.342): The amount of pooling in the baseline network already turned out to be optimal, and thus, no substantial changes were made, explaining this result. For the second evaluation scenario, imposing relatively lax restrictions on the localization accuracy, our best-performing system achieved a PSDS score of 0.7312 on the validation data. This is significantly better than the performance obtained by the baseline model (0.527), which can effectively be attributed to the changes that were applied to the pooling operations of the network.
翻译:在这份技术报告中,我们为DCASE 2021 挑战的子任务4提交的系统详细介绍了关于健全的事件探测的系统,这些模型与为这一问题提供的基线密切相关,因为这些模型基本上是在平均教师环境下训练的循环神经网络,在平均教师环境中处理所提供数据的多式说明方面受过训练;然而,预测的解析时间经过调整,以应付以下事实,即这些系统是用两个交叉的基于交叉的衡量标准进行评估,其中涉及时间本地化方面的不同需要,这是通过优化集合作业完成的。对于第一个确定的评价假设,对时间本地化准确性规定了相对严格的要求,我们的最佳模型在验证数据上达到了0.3609的PS分。这只略好于基线系统(0.342)所获得的业绩:基线网络的集中程度已经达到最佳水平,因此没有作出重大改变,解释这一结果。在第二个评价假设中,对本地化准确性规定相对宽松的限制,我们的最佳业绩系统在验证数据上达到了0.7312分,我们的最佳评价假设是,在验证数据上达到0.36099分,这比基线运行率大大好。