Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.
翻译:声音事件本地化旨在估计声音接收器(例如麦克风阵列)在环境中声音源的位置。该领域最近的进展最突出地侧重于利用深层的经常性神经网络。受变压器结构作为传统经常性神经网络的合适替代物的成功启发,本文件引入了一个新型的基于变压器的声音事件本地化框架,通过自省机制捕捉到收到的多频道音频信号的时间依赖性。此外,估计声音事件位置以多变化高斯变数为代表,产生了额外的不确定性概念,而许多先前为这一应用设计的深层基于学习的系统没有提供这种概念。这个框架以三种公开的多源声音事件本地化数据集为基础,并与最新的本地化错误和事件探测准确性方法进行比较,它超越了所有数据集上所有相互竞争的系统,其性能在统计上存在显著差异。