Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra model weights or supervision, experimental results on multiple downstream datasets demonstrate MaskSpec achieves a significant performance gain against the supervised methods and outperforms the previous pre-trained models. In particular, our best model reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on OpenMIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A respectively.
翻译:以变压器为基础的模型在接受足够数量的数据培训时,取得极好的结果,并在接受足够数量的数据培训时加以概括。然而,由于受音频域现有有限数据的限制,大多数以变压器为基础的音频任务模型与其它领域(如图像)的预培训模型(如图像)相比,与音频域存在显著差距。其他方法则直接在音频域探索自我监督的学习方法,但在目前履行下游任务方面表现不佳。在本文件中,我们为基于变压器的音频模型(称为隐蔽光谱预测(MaskSpec))提出了一个新的自我监督学习方法,以学习无标签的音频数据(本文中使用的AudioSet),从其他领域的预培训模型(AudioSetet)中进行微调。我们的方法掩码是输入光谱的随机拼图,用编码解码结构重建蒙蔽区域。在不使用额外模型权重或监督的情况下,多个下游数据集的实验结果显示,MAskSpecro20在监督方法下取得了显著的性收益,并超越了先前培训过的模型。特别是,我们的最佳模型在AMAS-S-SA-38(0.1820)上分别实现了0.47S-S-MAS-S-S-S-MAS-S-S-S-S-S-S-MA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-