We introduce DECAR, a self-supervised pre-training approach for learning general-purpose audio representations. Our system is based on clustering: it utilizes an offline clustering step to provide target labels that act as pseudo-labels for solving a prediction task. We develop on top of recent advances in self-supervised learning for computer vision and design a lightweight, easy-to-use self-supervised pre-training scheme. We pre-train DECAR embeddings on a balanced subset of the large-scale Audioset dataset and transfer those representations to 9 downstream classification tasks, including speech, music, animal sounds, and acoustic scenes. Furthermore, we conduct ablation studies identifying key design choices and also make all our code and pre-trained models publicly available.
翻译:我们引入了DECAR, 这是一种学习通用音频演示的自我监督培训前方法。我们的系统以集群为基础:它使用离线集群步骤提供目标标签,作为解决预测任务的假标签。我们除了在计算机视觉自监督学习方面取得的最新进展之外,还开发出一种轻巧、易于使用的自监督的预培训计划。我们预先将DECAR嵌入一个均衡的大型音频数据集子集,并将这些演示转移到9个下游分类任务,包括语言、音乐、动物声音和声学场景。此外,我们还进行减缩研究,确定关键设计选择,并公布我们的所有代码和预先培训的模式。