Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. This paper presents a multi-grained based attention network (MGA-Net) for semi-supervised sound event detection. To obtain the feature representations related to sound events, a residual hybrid convolution (RH-Conv) block is designed to boost the vanilla convolution's ability to extract the time-frequency features. Moreover, a multi-grained attention (MGA) module is designed to learn temporal resolution features from coarse-level to fine-level. With the MGA module,the network could capture the characteristics of target events with short- or long-duration, resulting in more accurately determining the onset and offset of sound events. Furthermore, to effectively boost the performance of the Mean Teacher (MT) method, a spatial shift (SS) module as a data perturbation mechanism is introduced to increase the diversity of data. Experimental results show that the MGA-Net outperforms the published state-of-the-art competitors, achieving 53.27% and 56.96% event-based macro F1 (EB-F1) score, 0.709 and 0.739 polyphonic sound detection score (PSDS) on the validation and public set respectively.
翻译:由于实际生活中缺少数据和各种声音事件,因此发现事件是一项有趣但具有挑战性的任务。本文件展示了半监视性事件探测的多重关注网络(MGA-Net),以进行半监视性声音事件探测。为了获得与声音事件有关的特征表现,一个残留混合共变(RH-Conv)块的设计目的是提高香草共变能力以提取时间频率特征。此外,一个多重关注模块(MGA)的设计是为了从粗糙到细微的层次学习时间分辨率特征。有了MGA模块,网络可以以短或长时间间隔来捕捉目标事件的特点,从而更准确地确定声音事件的开始和抵消。此外,为了有效提高正常教师(MT)方法的性能,引入了一个空间变化模块,作为数据渗透机制,以增加数据的多样性。实验结果表明,MGA-Net超越了已出版的状态-艺术竞争者,实现了53.27%和56.96%事件事件事件事件对0.71号(EBR-39-F1)分别设定的MS-MSM1和0.771和MS-MS-Rzy roudal rocal rogyal rogy rocal rocal rocal rocal rogment rogard1(EBS-F1) rogard1和MIS-F1和MS-S- saliz1和MS- saliz1) rogard1和M1和MS-S-gard1和MS-gard1 rogard1 rogard1(EBard1) rogard1) rogard1)的MS1(EBS1-)。