BEATs: 与声响收音机进行音响预培训 (BEATs: Audio Pre-Training with Acoustic Tokenizers)

The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

翻译：自监督学习(SSL)在过去几年里在语言、视觉、言语和音频领域大规模增长。虽然对其它模式广泛采用离散标签预测,但最先进的音频 SSL 模型仍然使用重建损失作为培训前的重建损失。与重建损失相比,语义丰富的离散标签预测鼓励SSL模型抽取高级音义语义,并丢弃人类感知中的多余细节。然而,一般音频前培训的语义内容丰富的音频信号器通常不是直接获得的,因为音频和不可用的音频50 序列的特性。为了应对这一挑战,我们建议BEATs, 一个迭接连的音频预培训框架,学习音频变异体的双向电解码器代表。在那里,音义象征器和音频 SSLSL 模型鼓励抽取。在最初的音频/音频信号显示器上,我们不用音频-音频信号显示各种音频SLSLSL的模型,用来在面具和标签预测方式上训练一个音频SLM-SL的模型。然后,我们为下一个SLSLSLSARSLSDSDSD的高级模型,然后用SDSDSDSDSDSD的高级模型和SildrealSLSL的升级的模拟,通过SLSildSL的升级的SDSDSD的SDSDSDSDSDSDSDSDSDSDSD再演算算码,再分析,再演算算算码,然后用SD。