Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.
翻译:从音乐录音中提取音频信息是音乐信号处理中一个具有挑战性但很重要的问题。 框架性抄录或多功能估计旨在探测多声音乐录音中投音同时活动的情况,最近由于深层次学习技术,以及各种拟议的网络结构,我们看到了重大改进。 在本文中,我们根据CNN、Unet结构和自省组件,实现了不同的结构结构。我们建议对这些结构进行若干修改,包括跳过连接的自知模块、取代自我注意的经常层以及同时预测多声调程度的多任务战略。我们对这些不同规模的结构的变异进行对比,用于多声调评估,我们用音乐网和Shubert Winterreiseise数据集来重点介绍比钢琴-声乐场情景以外的西方古典音乐。 我们的实验表明,大多数结构产生竞争性结果,而更大的模型变异体似乎是有益的。 然而,我们发现这些结果在很大程度上取决于随机化效果和特定培训测试的分化选择,这就必然质疑某种结构的优劣性,因此,在某种特定的结构中,我们需要对某种特定数据流流流流流流流中,因此对某种最佳的顺序进行评估。