Version identification (VI) has seen substantial progress over the past few years. On the one hand, the introduction of the metric learning paradigm has favored the emergence of scalable yet accurate VI systems. On the other hand, using features focusing on specific aspects of musical pieces, such as melody, harmony, or lyrics, yielded interpretable and promising performances. In this work, we build upon these recent advances and propose a metric learning-based system systematically leveraging four dimensions commonly admitted to convey musical similarity between versions: melodic line, harmonic structure, rhythmic patterns, and lyrics. We describe our deliberately simple model architecture, and we show in particular that an approximated representation of the lyrics is an efficient proxy to discriminate between versions and non-versions. We then describe how these features complement each other and yield new state-of-the-art performances on two publicly available datasets. We finally suggest that a VI system using a combination of melodic, harmonic, rhythmic and lyrics features could theoretically reach the optimal performances obtainable on these datasets.
翻译:版本识别(VI) 在过去几年中取得了实质性进展。 一方面, 引入标准学习模式有利于出现可缩放但准确的六种系统。 另一方面, 使用侧重于音乐片具体方面的功能, 如旋律、 和谐或歌词, 产生可解释和有希望的表演。 在这项工作中, 我们利用这些最新进展, 并提出了一个基于标准学习的系统, 系统地利用通常公认的四个维度来传递不同版本之间的音乐相似性: 旋律线、 口音结构、 节奏模式和歌词。 我们描述了我们有意的简单模型结构, 我们特别表明歌词的近似表达方式是区分版本和非版本的有效代言。 我们然后描述这些特征如何相互补充, 在两个公开的数据集中产生新的最新表现。 我们最后建议, 使用混合、 调和节奏、 节奏和歌词组合的六系统, 理论上可以达到这些数据集获得的最佳表演。