Vision transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by data hungry nature of transformers and limited labelled data most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the natural images domain and audio domain. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose ASiT, a novel self-supervised transformer for general audio representations that captures local and global contextual information employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance on five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining. The code and pretrained weights will be made publicly available for the scientific community.
翻译:最初为自然语言处理开发的视觉变压器,由于在学习长距离关系方面的灵活性,最近对计算机视觉和音频社区产生了极大的兴趣。由于变压器的数据饥饿性质和有标签的音频任务模型有限,大多数以变压器为基础的变压器模型根据图像网预先培训模型进行微调,尽管自然图像域与音频域之间存在巨大差距。这促使对自我监督的音频变压器预先培训进行研究,减少对大量标签数据的依赖,并侧重于提取音频光谱的简明表述。在本文中,我们提议ASIT,这是一个新的自我监督变压器,用于一般音频表达,利用集体遮蔽的模型学习和自我蒸馏,捕捉本地和全球背景信息。我们评估我们预先培训的音频和语音分类工作模式,包括音频事件分类、关键词定位和语音识别。我们进一步进行全面的变压研究,包括对不同培训前战略的评价。拟议的ASIT框架极大地提升了所有任务的业绩,并设置了新的状态,即自我监督变压变压器变压器,用以获取当地和全球背景信息信息,包括进行新的音频和语音和语音前语言分析。