Recently, convolutional neural networks (CNNs) have been widely used in sound event detection (SED). However, traditional convolution is deficient in learning time-frequency domain representation of different sound events. To address this issue, we propose multi-dimensional frequency dynamic convolution (MFDConv), a new design that endows convolutional kernels with frequency-adaptive dynamic properties along multiple dimensions. MFDConv utilizes a novel multi-dimensional attention mechanism with a parallel strategy to learn complementary frequency-adaptive attentions, which substantially strengthen the feature extraction ability of convolutional kernels. Moreover, in order to promote the performance of mean teacher, we propose the confident mean teacher to increase the accuracy of pseudo-labels from the teacher and train the student with high confidence labels. Experimental results show that the proposed methods achieve 0.470 and 0.692 of PSDS1 and PSDS2 on the DESED real validation dataset.
翻译:最近,共变神经网络(CNNs)被广泛用于健全的事件探测(SED),然而,传统变迁在学习不同声音事件的时间频域代表性方面存在不足,为了解决这一问题,我们提议多维频率动态演化(MFD Convon),这是一个在多个维度上赋予具有频率适应动态特性的共变内核的新设计。MFD Conv利用一个新颖的多维关注机制来学习补充性频率适应性关注的平行战略,这大大加强了共变内核的特征提取能力。此外,为了提高中度教师的性能,我们提议一个自信的教师,以提高教师的假标签的准确性,并以高度自信的标签对学生进行培训。实验结果表明,拟议的方法在DESED真实的验证数据集上实现了0.470和0.692的DPS1和DPS2。