Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6% each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16% and 7%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.
翻译:声音事件本地化和检测( SELD) 由两个子任务组成。 声音事件检测主要依靠时间频率模式来区分不同的声音等级, 抵达方向估计使用麦克风之间的振动和/或相位差异来估计源方向。 因此, 通常很难联合优化这两个子任务。 我们提出了一个新颖的特征, 叫做空间即时显示的对地分光仪( SALSA), 信号动力和源方向提示之间精确的时间频率映射, 这对于解决重叠的声频特性至关重要。 虽然音频探测主要依靠时间频率模式来区分不同的声音等级, 抵达方向的定位使用麦克风之间的振荡和到达方向。 SALSA 功能包括多频道日志- 分光仪, 与每个相应的时频谱矩阵矩阵的正向相堆叠叠叠叠叠叠叠。 根据麦克风阵列格式, 主源源源源源的对麦克风和( SALSAL) 方向对不同的麦克风阵列阵列阵列, 将SAL- 16级阵列阵列阵列格式用于不同的麦克风阵列阵列格式,, 将SMA 的每个直径直径直径直径对等次的对地磁 。