SALSA: 用于多声声音事件定位和探测的空间缩放测日图特征 (SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection)

from arxiv, (c) 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone array format, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC). Experimental results on the TAU-NIGENS Spatial Sound Events 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6% each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16% and 7%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.

翻译：声音事件本地化和检测( SELD) 由两个子任务组成。声音事件检测主要依靠时间频率模式来区分不同的声音等级, 抵达方向估计使用麦克风之间的振动和/或相位差异来估计源方向。因此, 通常很难联合优化这两个子任务。我们提出了一个新颖的特征, 叫做空间即时显示的对地分光仪( SALSA), 信号动力和源方向提示之间精确的时间频率映射, 这对于解决重叠的声频特性至关重要。虽然音频探测主要依靠时间频率模式来区分不同的声音等级, 抵达方向的定位使用麦克风之间的振荡和到达方向。 SALSA 功能包括多频道日志- 分光仪, 与每个相应的时频谱矩阵矩阵的正向相堆叠叠叠叠叠叠叠。根据麦克风阵列格式, 主源源源源源的对麦克风和( SALSAL) 方向对不同的麦克风阵列阵列阵列, 将SAL- 16级阵列阵列阵列格式用于不同的麦克风阵列阵列格式,, 将SMA 的每个直径直径直径直径对等次的对地磁。