Deep learning is very data hungry, and supervised learning especially requires massive labeled data to work well. Machine listening research often suffers from limited labeled data problem, as human annotations are costly to acquire, and annotations for audio are time consuming and less intuitive. Besides, models learned from labeled dataset often embed biases specific to that particular dataset. Therefore, unsupervised learning techniques become popular approaches in solving machine listening problems. Particularly, a self-supervised learning technique utilizing reconstructions of multiple hand-crafted audio features has shown promising results when it is applied to speech domain such as emotion recognition and automatic speech recognition (ASR). In this paper, we apply self-supervised and multi-task learning methods for pre-training music encoders, and explore various design choices including encoder architectures, weighting mechanisms to combine losses from multiple tasks, and worker selections of pretext tasks. We investigate how these design choices interact with various downstream music classification tasks. We find that using various music specific workers altogether with weighting mechanisms to balance the losses during pre-training helps improve and generalize to the downstream tasks.
翻译:深层学习非常缺乏数据,监督学习尤其需要大量标签数据才能顺利运行。 机器听觉研究往往有有限的标签数据问题,因为获取人的注释费用昂贵,而且音频说明耗时且直观程度较低。 此外,从标签数据集中学习的模型往往含有特定数据集特有的偏差。因此,在解决机器听觉问题时,不受监督的学习技术成为流行的方法。 特别是,利用重建多手制作的音频功能的自我监督学习技术,在应用到语言领域,如情感识别和自动语音识别(ASR)时,已经显示出有希望的结果。 在本文中,我们应用自我监督的多任务学习方法来培训音乐编译员,并探索各种设计选择,包括编码结构,加权机制,以综合多重任务的损失,以及工人选择托辞任务。我们调查这些设计选择如何与各种下游音乐分类任务相互作用。我们发现,使用各种特定音乐工作者的加权机制来平衡培训前的亏损,有助于改进和概括下游任务。