Conventional music structure analysis algorithms aim to divide a song into segments and to group them with abstract labels (e.g., 'A', 'B', and 'C'). However, explicitly identifying the function of each segment (e.g., 'verse' or 'chorus') is rarely attempted, but has many applications. We introduce a multi-task deep learning framework to model these structural semantic labels directly from audio by estimating "verseness," "chorusness," and so forth, as a function of time. We propose a 7-class taxonomy (i.e., intro, verse, chorus, bridge, outro, instrumental, and silence) and provide rules to consolidate annotations from four disparate datasets. We also propose to use a spectral-temporal Transformer-based model, called SpecTNT, which can be trained with an additional connectionist temporal localization (CTL) loss. In cross-dataset evaluations using four public datasets, we demonstrate the effectiveness of the SpecTNT model and CTL loss, and obtain strong results overall: the proposed system outperforms state-of-the-art chorus-detection and boundary-detection methods at detecting choruses and boundaries, respectively.
翻译:常规音乐结构分析算法旨在将一首歌分成各段,并将之与抽象标签(如“A”、“B”和“C”等)组合在一起。然而,我们很少尝试明确确定每一段的功能(如“verse”或“chorus”),但有许多应用。我们引入了一个多任务深层次学习框架,通过估计“逆差”、“合唱”等,直接从音频中模拟这些结构语义标签,作为时间的函数。我们建议采用7级分类(如,内流、逆、合唱、桥梁、外行、工具、沉默),并提供规则,以合并四个不同数据集的注释。我们还提议使用光谱-时变变模型,称为SpecTNT, 可以通过额外连接时间定位(CTL)损失来进行训练。在使用四个公共数据集进行交叉数据评估时,我们展示SpecTNT模型和CTL损失的有效性,并提供从四个不同数据集中收集强有力的说明。我们还提议使用一个光谱- 和测查越边界方法。