Sound Event Localization and Detection (SELD) is a problem related to the field of machine listening whose objective is to recognize individual sound events, detect their temporal activity, and estimate their spatial location. Thanks to the emergence of more hard-labeled audio datasets, deep learning techniques have become state-of-the-art solutions. The most common ones are those that implement a convolutional recurrent network (CRNN) having previously transformed the audio signal into multichannel 2D representation. The squeeze-excitation technique can be considered as a convolution enhancement that aims to learn spatial and channel feature maps independently rather than together as standard convolutions do. This is usually achieved by combining some global clustering operators, linear operators and a final calibration between the block input and its learned relationships. This work aims to improve the accuracy results of the baseline CRNN presented in DCASE 2020 Task 3 by adding residual squeeze-excitation (SE) blocks in the convolutional part of the CRNN. The followed procedure involves a grid search of the ratio parameter (used in the linear relationships) of the residual SE block, whereas the hyperparameters of the network remain the same as in the baseline. Experiments show that by simply introducing the residual SE blocks, the results obtained are able to improve the baseline considerably.
翻译:声控事件本地化和探测(SELD)是一个与机器监听领域有关的问题,机器监听的目的是识别单个声音事件,探测其时间活动,并估计其空间位置。由于出现了更硬标签的音频数据集,深层学习技术已成为最先进的解决方案。最常见的方法是实施循环循环网络(CRCNNN),先前将音频信号转换为多通道 2D 代表的网络(CRNNN) 。挤压刺激技术可被视为一种渐进式增强技术,目的是独立学习空间和频道特征地图,而不是同时学习标准组合。这通常是通过将某些全球集群操作员、线性操作员和最终校准区块输入及其学习关系加以实现的。这项工作的目的是通过在CRNNE 2020任务3中添加剩余挤压源(SE)块来提高基准的准确性结果。以下程序涉及对SE区残余线性关系中所使用的比率参数进行网格搜索,而网络的超分数通常通过合并一些全球集操作员、线性操作员和对块输入基线的结果进行大幅改进。实验显示SEE区的基线。