Music structure analysis (MSA) methods traditionally search for musically meaningful patterns in audio: homogeneity, repetition, novelty, and segment-length regularity. Hand-crafted audio features such as MFCCs or chromagrams are often used to elicit these patterns. However, with more annotations of section labels (e.g., verse, chorus, and bridge) becoming available, one can use supervised feature learning to make these patterns even clearer and improve MSA performance. To this end, we take a supervised metric learning approach: we train a deep neural network to output embeddings that are near each other for two spectrogram inputs if both have the same section type (according to an annotation), and otherwise far apart. We propose a batch sampling scheme to ensure the labels in a training pair are interpreted meaningfully. The trained model extracts features that can be used in existing MSA algorithms. In evaluations with three datasets (HarmonixSet, SALAMI, and RWC), we demonstrate that using the proposed features can improve a traditional MSA algorithm significantly in both intra- and cross-dataset scenarios.
翻译:音乐结构分析(MSA)方法(MSA)传统上在音频中传统地寻找具有音乐意义的模式:同质性、重复性、新颖性和分段性常态。手工制作的音频特征,如MFCCs或染色体,常常被用来诱发这些模式。然而,随着章节标签(如诗歌、合唱和桥梁)的更多说明的出现,人们可以使用受监督的特征学习方法使这些模式更加清晰,并改进管理服务协议的性能。为此,我们采取了一种受监督的衡量学习方法:我们训练一个深层神经网络,将输出嵌入彼此相近的两种光谱输入,如果两者都具有相同的部分类型(根据注释),或者相距甚远。我们建议一个批量抽样方案,以确保培训配对中的标签得到有意义的解释。经过培训的模型摘录特征可用于现有的管理协议算法。在对三个数据集(HarmonixSet、SALAMI和RWC)的评估中,我们证明,使用拟议的特征可以大大改进内部和交叉数据配置情景中的传统管理算法。