SSSST:自监督音频分光变换器 (SSAST: Self-Supervised Audio Spectrogram Transformer)

Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

翻译：最近,纯粹以自我自省为基础的神经网络,如View Tranger(VIT),被证明在各种愿景任务上比与神经神经网络(CNNs)共同建立的深层次学习模型要好,从而将原先为语言处理开发的变异器的成功率扩大到视觉领域。最近的一项研究表明,类似的方法也可以适用于音域。具体地说,音频光谱变异器(AST)在各种音频分类基准中取得了最先进的成果。然而,纯变异器模型往往比CNN需要更多的培训数据,而AST的成功依赖于监督的预培训,这需要大量的标签数据和复杂的培训管道,从而限制了AST的实际使用。本文侧重于音频和语音分类,目的是通过利用无标签数据自行超强的学习,从而在AST模型中以最佳和基因化的镜像光谱模型进行预选。A-Stual-scripal-scrography-res 校准框架(MSPM)也比我们60级的音频-Sil-Sil-Silva-Sil-Sil-deal-Siltraction-ILILID) 和Servical-Sil Silal-Sild Silation A-Silvical Silation A-Silence 和Silvical-Silviducal-Silvical-Sl)所有A-S-Slation A-S-Sl 和S-Slviducal-Scial-Scial-Sildal-Slation Flation-Slation A-Slation 和Slation 和Slation 和Sild-Slation A-S-Sil-Sil-Sildal-Sil-Sil-Sil-Sil-Sil-Sil-Sil-Sil-Sci-Slation-Scial-Scial-Scial-Scial-Scial-Slation-Slation-Scial-Slation A-S-S-S-S-S-S-S-S-S-S-SLI-S-S-S-S-S-Sil-S-Sil-S