In this paper, we propose a novel four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection (SELD). First, we explore two spatial augmentation techniques, namely audio channel swapping (ACS) and multi-channel simulation (MCS), to deal with data sparsity in SELD. ACS and MDS focus on augmenting the limited training data with expanding direction of arrival (DOA) representations such that the acoustic models trained with the augmented data are robust to localization variations of acoustic sources. Next, time-domain mixing (TDM) and time-frequency masking (TFM) are also investigated to deal with overlapping sound events and data diversity. Finally, ACS, MCS, TDM and TFM are combined in a step-by-step manner to form an effective four-stage data augmentation scheme. Tested on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 data set, our proposed augmentation approach greatly improves the system performance, ranking our submitted system in the first place in the SELD task of the DCASE 2020 Challenge. Furthermore, we employ a ResNet-Conformer architecture to model both global and local context dependencies of an audio sequence and win the first place in the DCASE 2022 SELD evaluations.
翻译:在本文中,我们提出一个新的四阶段数据增强办法,用于ResNet-Conder软件的声学模型,用于声音事件定位和检测。首先,我们探索两种空间增强技术,即音信道互换(ACS)和多声道模拟(MCS),以应对SELD的数据宽度。ACS和MDS侧重于扩大有限的培训数据,扩大抵达方向(DOA),使经过强化数据培训的声学模型对声学源的本地化变异具有很强的功能。接下来,还调查了时间间隔混合(TDM)和时频遮罩(TFM),以处理重叠的声学事件和数据多样性。最后,ACS、MCS、TDM和TFM以逐步方式结合,形成一个有效的四阶段数据增强计划。根据2020年声学测和事件探测和分类(DCASE)数据集进行测试,我们提议的扩音法方法大大改进了系统性能,将我们提交的系统排在2020年DCASESE的SE-Conferive Airal Airmal Airs,我们采用了2020 Char-SySySure Airst 20SySEA Airview 和DC Airst 。此外,我们采用了了20SIS SAYSEA 20SIS Airst ASyal 和20SIS Asim 20SEA 。</s>