Performance of sound event localization and detection (SELD) in real scenes is limited by small size of SELD dataset, due to difficulty in obtaining sufficient amount of realistic multi-channel audio data recordings with accurate label. We used two main strategies to solve problems arising from the small real SELD dataset. First, we applied various data augmentation methods on all data dimensions: channel, frequency and time. We also propose original data augmentation method named Moderate Mixup in order to simulate situations where noise floor or interfering events exist. Second, we applied Squeeze-and-Excitation block on channel and frequency dimensions to efficiently extract feature characteristics. Result of our trained models on the STARSS22 test dataset achieved the best ER, F1, LE, and LR of 0.53, 49.8%, 16.0deg., and 56.2% respectively.
翻译:实际场景中声音事件定位和探测(SELD)的性能因SELD数据集规模小而受到限制,原因是难以获得足够数量的符合实际的多频道录音数据记录,并贴上准确标签。我们使用两个主要战略来解决小型实际 SELD数据集引起的问题。首先,我们在所有数据层面采用了各种数据增强方法:频道、频率和时间。我们还提出了原始数据增强方法,名为中继混集法,以模拟存在噪音地板或干扰事件的情况。第二,我们在频道和频率层面使用了“电磁带”和“输出”块,以有效提取特征特征。我们在STARSS22测试数据集中经过培训的模型的结果分别实现了0.53、49.8%、16.0deg.和56.2%的最佳ER、F1、LE和LR。