The lack of annotated training data in bioacoustics hinders the use of large-scale neural network models trained in a supervised way. In order to leverage a large amount of unannotated audio data, we propose AVES (Animal Vocalization Encoder based on Self-Supervision), a self-supervised, transformer-based audio representation model for encoding animal vocalizations. We pretrain AVES on a diverse set of unannotated audio datasets and fine-tune them for downstream bioacoustics tasks. Comprehensive experiments with a suite of classification and detection tasks have shown that AVES outperforms all the strong baselines and even the supervised "topline" models trained on annotated audio classification datasets. The results also suggest that curating a small training subset related to downstream tasks is an efficient way to train high-quality audio representation models. We open-source our models at \url{https://github.com/earthspecies/aves}.
翻译:在生物动物声学方面缺乏附加说明的培训数据,妨碍了使用经过监督培训的大型神经网络模型。为了利用大量未经附加说明的音频数据,我们提议AVES(基于自译自审的动物声学自监管、基于变压器的音频代言模型),这是用于编码动物声学的自监管、基于变压器的自监管的音频代言模型。我们在一套未经附加说明的音频数据集上预设了AVES, 并微调了下游生物声学任务。一套分类和检测任务的全面实验表明,AVES超越了所有强大的基线,甚至是在附加说明的音频分类数据集上培训的“顶部”模型。结果还表明,与下游任务有关的小型培训分组是培训高质量音频代言模型的有效途径。我们在以下各处打开了我们的模型源头:<https://github.com/earthspecies/aves}。