Many state-of-the-art systems for audio tagging and sound event detection employ convolutional recurrent neural architectures. Typically, they are trained in a mean teacher setting to deal with the heterogeneous annotation of the available data. In this work, we present a thorough analysis of how changing the temporal resolution of these convolutional recurrent neural networks - which can be done by simply adapting their pooling operations - impacts their performance. By using a variety of evaluation metrics, we investigate the effects of adapting this design parameter under several sound recognition scenarios involving different needs in terms of temporal localization.
翻译:许多最先进的音频标记和音频事件探测系统都采用循环性常态神经结构,通常在平均教师环境中对他们进行培训,以处理现有数据的各种说明。在这项工作中,我们透彻地分析了这些循环性神经网络的瞬时分辨率的变化如何影响其性能,这种变化可以通过仅仅调整其集成操作来完成。我们通过使用各种评估指标,调查在涉及时间本地化不同需求的几种合理识别假设下调整这一设计参数的效果,这些假设涉及时间本地化的不同需求。