真实空间声音场景的音响事件定位和探测:事件独立网络和数据增强链条 (Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains)

from arxiv, Submitted to DCASE 2022 Workshop. Code is available at https://github.com/Jinbo-Hu/DCASE2022-TASK3. arXiv admin note: substantial text overlap with arXiv:2203.10228

Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In DCASE 2022 Task 3, types of data transform from computationally generated spatial recordings to recordings of real-sound scenes. Our system submitted to the DCASE 2022 Task 3 is based on our previous proposed Event-Independent Network V2 (EINV2) and a novel data augmentation method. To detect different sound events of the same type with different locations, our method employs EINV2, combining a track-wise output format, permutation-invariant training, and soft parameter-sharing. EINV2 is also extended to use conformer structures to learn local and global patterns. To improve the generalization ability of the model, we use a data augmentation method containing several data augmentation chains, which are composed of random combinations of several different data augmentation operations. To mitigate the lack of real-scene recordings in the development dataset and the presence of sound events being unbalanced, we exploit FSD50K, AudioSet, and TAU Spatial Room Impulse Response Database (TAU-SRIR DB) to generate simulated datasets for training. We present results on the validation set of Sony-TAu Realistic Spatial Soundscapes 2022 (STARSS22) in detail. Experimental results indicate that the ability to generalize to different environments and unbalanced performance among different classes are two main challenges. We evaluate our proposed method in Task 3 of DCASE 2022 challenge and obtain the second rank in the teams ranking. Source code is released.

翻译：在DCASE 2022 任务3 中,数据类型从计算产生的空间记录转换为真实的场景记录。我们向DCASE 2022 任务3 提交的系统是基于我们先前提议的独立网络V2(EINV2)和一个新型的数据增强方法。为了与不同地点一起探测不同类型的声音事件,我们的方法使用了EINV2, 结合了一种跟踪-智能输出格式、变异源代码培训以及软参数共享。EINV2 也扩展了使用匹配结构来学习本地和全球模式。为了提高模型的通用能力,我们使用了包含若干数据增强链的数据增强方法,其中包括若干不同的数据增强操作的随机组合。为了减少发展数据集中缺少真实的录音,以及存在不平衡的音频事件,我们利用了FSD50K、AudioSet和TAUS 空间室内隐性反应数据库(TAU-SRIRB) 以模拟20-SRA系统常规测试系统(我们当前20-SRI) 和SAL 常规测试系统的第二级测试中, 我们的Sal-real-realalal Adal Adal Adal Adal Adal Acreal Acreal Adal Adal Adal Adal resslational Acreal ressl) 能力显示的第二级,我们SIS Creal dal 将显示的Slational dal dal dal dal dal dal 能力在20-real dreal Asal Adalal resental dreal dreal dreal dreal drectional dreal dreal dreal dreal ressal dreal real real real dreal ressal dreal real dal dal dal dal dal dal dal ressal dal dal ressal ressal dal assal assal resal ressreal assal ressal ressal assal ressreal dal ressreal ressal ressal ressal ressal resisal ressal lactional