Music annotation has always been one of the critical topics in the field of Music Information Retrieval (MIR). Traditional models use supervised learning for music annotation tasks. However, as supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. In this paper, a new self-supervised music acoustic representation learning approach named MusiCoder is proposed. Inspired by the success of BERT, MusiCoder builds upon the architecture of self-attention bidirectional transformers. Two pre-training objectives, including Contiguous Frames Masking (CFM) and Contiguous Channels Masking (CCM), are designed to adapt BERT-like masked reconstruction pre-training to continuous acoustic frame domain. The performance of MusiCoder is evaluated in two downstream music annotation tasks. The results show that MusiCoder outperforms the state-of-the-art models in both music genre classification and auto-tagging tasks. The effectiveness of MusiCoder indicates a great potential of a new self-supervised learning approach to understand music: first apply masked reconstruction tasks to pre-train a transformer-based model with massive unlabeled music acoustic data, and then finetune the model on specific downstream tasks with labeled data.
翻译:音乐说明一直是音乐信息检索(MIR)领域的关键主题之一。传统模式使用监督的学习进行音乐说明任务。但是,随着受监督的机器学习方法的复杂程度增加,对更多附加说明的培训数据的日益需要往往无法与现有数据相匹配。在本文中,提出了名为 MusiCoder 的自监督的音乐声学教学新办法。受到BERT成功启发,MusiCoder 以自我注意双向变压器的结构为基础。两个培训前目标,包括“可探测框架遮罩”(CFM)和“相邻通道遮罩”(CCMM),旨在将类似于BERT的蒙面重建培训前的日益需要与现有数据匹配。MusiCoder的绩效在两个下游音乐说明任务中得到了评估。结果显示,MusiCoder在音乐基因分类和自动挂图任务中都超越了状态的艺术模型。Musi Coder 的两个培训前目标, 包括“CMiscoder ” 和“相连接的频道遮掩”, 设计了将新的自我校正型模型和大规模数据重组任务运用于随后的模型, 将新的自我校正压化数据转换的模型用于了解音乐。