Acoustic scene classification (ASC) has been approached in the last years using deep learning techniques such as convolutional neural networks or recurrent neural networks. Many state-of-the-art solutions are based on image classification frameworks and, as such, a 2D representation of the audio signal is considered for training these networks. Finding the most suitable audio representation is still a research area of interest. In this paper, different log-Mel representations and combinations are analyzed. Experiments show that the best results are obtained using the harmonic and percussive components plus the difference between left and right stereo channels, (L-R). On the other hand, it is a common strategy to ensemble different models in order to increase the final accuracy. Even though averaging different model predictions is a common choice, an exhaustive analysis of different ensemble techniques has not been presented in ASC problems. In this paper, geometric and arithmetic mean plus the Ordered Weighted Averaging (OWA) operator are studied as aggregation operators for the output of the different models of the ensemble. Finally, the work carried out in this paper is highly oriented towards real-time implementations. In this context, as the number of applications for audio classification on edge devices is increasing exponentially, we also analyze different network depths and efficient solutions for aggregating ensemble predictions.
翻译:在过去几年中,使用进化神经网络或经常神经网络等深层学习技术对声学场景进行了分类(ASC),许多最先进的解决方案都以图像分类框架为基础,因此,为培训这些网络,考虑对音信号进行2D表示,寻找最合适的音频表示法仍然是一个感兴趣的研究领域。在本文中,对不同的日志-模拟和组合进行了分析。实验显示,最佳结果是通过使用调和感应组件以及左侧和右立声频道之间的差异(L-R)获得的。另一方面,这是一个共同战略,将不同的模型组合起来,以提高最终准确性。尽管平均使用不同的模型预测是一种常见的选择,但对不同组合技术的详尽分析在 ASC 问题中并没有出现。在本文中,对不同测算和算平均值的平均值加上有秩序的WOWA(OA)操作员作为组合操作者研究,以汇总不同模型的输出结果(L-R)。在另一方面,为了提高最终准确性,本文中完成的工作是共同的通用战略。尽管平均模型预测应用是一种共同选择,但对不同组合技术进行详尽分析,但对于不同级的深度应用也是高度地分析。