A mixed sample data augmentation strategy is proposed to enhance the performance of models on audio scene classification, sound event classification, and speech enhancement tasks. While there have been several augmentation methods shown to be effective in improving image classification performance, their efficacy toward time-frequency domain features of audio is not assured. We propose a novel audio data augmentation approach named "Specmix" specifically designed for dealing with time-frequency domain features. The augmentation method consists of mixing two different data samples by applying time-frequency masks effective in preserving the spectral correlation of each audio sample. Our experiments on acoustic scene classification, sound event classification, and speech enhancement tasks show that the proposed Specmix improves the performance of various neural network architectures by a maximum of 2.7%.
翻译:为加强音频场景分类、音频事件分类和语音增强任务等模型的性能,提出了混合抽样数据增强战略,以提升音频场分类、音频事件分类和语音增强任务等模型的性能。虽然有几种增强方法证明在提高图像分类性能方面行之有效,但它们对音频时间-频率域特性的效力得不到保证。我们提出了一种名为“频谱”的新型音频数据增强方法,专门为处理时频域特性而设计。增强方法包括将两个不同的数据样本混合在一起,采用有效时间-频率掩罩保护每个音频样的频谱相关性。我们在声频场分类、音频事件分类和语音增强任务方面的实验显示,拟议的Specmix将各种神经网络结构的性能提高至2.7%。