Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.
翻译:从未贴标签的数据中学习代表性一直是人工智能研究的主要兴趣所在。 虽然自我监督的语音代表学习在语言研究界很受欢迎,但很少有作品全面分析了用于非声音任务的音频代表学习。 在本文中,我们提出自监督的音频代表学习方法,并将其应用于下游各种非声音任务。我们把众所周知的 wav2vec 2.0 框架与自监督的语音任务学习取得成功,并配有符合参数效率的校准结构。我们自我监督的预先培训可以减少对标签数据的需求三分之二。在音频服务基准上,我们实现了0.415的平均平均精确分数(mAP),这是通过只听音监的自我监督的学习,在这个数据集上的一种新状态。我们经过精细调整的校准校对的校准器也超过了或匹配了先前在监督下完成的若干下游任务的系统的业绩。我们进一步讨论了培训前和微调的重要设计考虑。