This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.
翻译:本文研究一个基于图像的蒙面自动编码器(MAE)的简单扩展,从声音光谱中学习自我监督的演示。 在MAE的变换器编码器编码解码器设计后, 我们的Audio-MAE首先编码了高遮罩率的音频光谱补丁, 仅通过编码器层喂养非面状。 解码器随后重新排序并解码了以遮面符号添加的编码背景图解, 以重建输入光谱。 我们认为将本地窗口的注意纳入解码器是有益的, 因为音频光谱仪在当地的时间和频率波段中高度相关。 我们然后微调编码器, 在目标数据集中以较低的掩码率进行保护。 Empiral, 音频-MAE 设置了六种音频和语音分类任务的新状态表现, 超过最近使用外部监管前训练的其他模型。 代码和模型将在 https://github.com/pacebookresearch/AudioMAE。