Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train these two subtasks simultaneously. We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival. The feature includes multichannel log-spectrograms stacked along with the estimated direct-to-reverberant ratio and a normalized version of the principal eigenvector of the spatial covariance matrix at each time-frequency bin on the spectrograms. Experimental results on the DCASE 2021 dataset for sound event localization and detection with directional interference showed that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin. We combined several models with slightly different architectures that were trained on the new feature to further improve the system performances for the DCASE sound event localization and detection challenge.
翻译:正确事件本地化和探测由两个子任务组成,它们是健全的事件探测和抵达方向估计。健全的事件探测主要依靠时间频率模式来区分不同的声音类别,而抵达方向估计则使用麦克风之间的大小或相位差异来估计源方向。因此,往往难以同时对这两个子任务进行同步培训。我们提出了一个新颖的特征,即空间暗示增强的日志谱(SALSA),在信号功率和抵达源方向之间精确的时间频率绘图。功能包括多频道日志谱,与估计的直对视比率一起堆叠在一起,以及每个时频中每个时频中空间常变异矩阵主机的正常版本,用于光谱图。DCASE 2021 数据集的实验结果显示,就这个新特征所训练的深层次学习模型比DCASE挑战基线大边缘。我们将几个模型与稍有差异的图像元件组合起来,用于对新特征检测系统进行微不同的测试。我们把一些地方模型与地方性能模型结合起来,用来改进了对新特征的探测系统。