SSSST:自监督音频分光变换器 (SSAST: Self-Supervised Audio Spectrogram Transformer)

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to alleviate the data requirement issues with the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

翻译：最近,纯粹以自我自省为基础的神经网络,例如Vife 变换器(VIT),显示在各种视觉任务方面,纯粹基于自我意识的神经网络(CNNs)所建的神经网络(NCNNs)的深度学习模型优于在各种视觉任务方面与革命性神经网络(NCNs)所建的深度学习模型,从而将原先为语言处理而开发的变换器的成功扩大到视觉领域。最近的一项研究显示,类似的方法也可以适用于音域。具体地说,音频光谱变换器(AST)在各种音频分类基准中取得了最先进的结果。然而,纯变换器模型往往比CNN的更需要更多的培训数据,而AST的成功依赖于监督下的预培训,这需要大量的标签数据和复杂的培训管道,从而限制了对AST的实际使用。本文侧重于音频和语音分类,目的是通过利用非标签的学习工具,将A-Stari-ST模型与升级前的音频-Slibal-Sdeal-lavey-lax A-lax A-he-lain lagy-lagial Stal-lagial-lagistration A-laudal-laudal-lagiment A-lacal-lagistration a lagistration lagistration lagiew A-the laudal and laudal-S-S-S-S-S-S-S-S-S-lagal-lagimental lagial-Slagial lagimental lagial 和Lis and labal labal labal-Sil lax labal-lax lax labal-labal-A-A-S-labal-Sil-labal-ladal-ladal-ladal-ladal-ladal-Sladal-ladal-Sladal-S-S-ladal-labal-lad-lad-S-S-S-S-S-S-S-S-ladal-labal-ladal-S-S-lad-S-S-lad-