Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis. Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system. In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC. Experiments are performed for two different datasets containing binaural recordings and synchronous sound event and acoustic scene labels to analyse the differences between performing SED and ASC separately or jointly. The presented results show that the use of specific binaural features, mainly the Generalized Cross Correlation with Phase Transform (GCC-phat) and sines and cosines of phase differences, result in a better performing model in both separate and joint tasks as compared with baseline methods based on logmel energies only.
翻译:声音事件探测(SED)和声学场景分类(ASC)是两项广泛研究的声学任务,是声学场景分析研究的一个重要部分。考虑到声学事件和声学场景之间的共享信息,共同执行这两项任务是复杂机器监听系统的一个自然部分。在本文中,我们调查了几个空间音频特征在培训联合深神经网络模型(DNN)方面的益处,该模型运行了SED和ASC。实验针对两个不同的数据集,其中包括双声录音和同步声学事件以及声学场标,以分别或联合分析进行声学场景分析SED和ASC之间的差异。介绍的结果显示,具体双声学特征的使用,主要是与阶段变换(GCC-phat)的一般交叉关系以及相形差异的正弦和相系系,导致与仅以日焦能为基础的基线方法相比,在单独和联合任务上都取得了更好的表现模型。