Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.
翻译:在回声环境中准确的本地化对于人类听觉感知至关重要。最近,利用了革命神经网络(CNNs)来模拟二进制人类听觉路径。然而,CNN展示了捕捉全球声学特征的障碍。为解决这一问题,我们提议了一个新型的端到端二进制音频谱变异器(BAST)模型,以预测在厌食和回动环境中的声对齐度。两种执行模式,即BAST-SP和BAST-NSP, 分别具有共享和不共享参数的BAST模型的BAST-NSP, 探索了它的一般化能力,以及本地化的硬化变异器的可行性。此外,还分析了本地化中更多对本地化过程的注意程度,并提供了对本地化过程的再分析。此外,还提供了对本地化过程进行的额外关注度分析。