用于多方案估算的深学习结构:争取可靠评价 (Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation)

Extracting pitch information from music recordings is a challenging but important problem in music signal processing. Frame-wise transcription or multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings and has recently seen major improvements thanks to deep-learning techniques, with a variety of proposed network architectures. In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components. We propose several modifications to these architectures including self-attention modules for skip connections, recurrent layers to replace the self-attention, and a multi-task strategy with simultaneous prediction of the degree of polyphony. We compare variants of these architectures in different sizes for multi-pitch estimation, focusing on Western classical music beyond the piano-solo scenario using the MusicNet and Schubert Winterreise datasets. Our experiments indicate that most architectures yield competitive results and that larger model variants seem to be beneficial. However, we find that these results substantially depend on randomization effects and the particular choice of the training-test split, which questions the claim of superiority for particular architectures given only small improvements. We therefore investigate the influence of dataset splits in the presence of several movements of a work cycle (cross-version evaluation) and propose a best-practice splitting strategy for MusicNet, which weakens the influence of individual test tracks and suppresses overfitting to specific works and recording conditions. A final evaluation on a mixed dataset suggests that improvements on one specific dataset do not necessarily generalize to other scenarios, thus emphasizing the need for further high-quality multi-pitch datasets in order to reliably measure progress in music transcription tasks.

翻译：从音乐录音中提取音频信息是音乐信号处理中一个具有挑战性但很重要的问题。框架性抄录或多功能估计旨在探测多声音乐录音中投音同时活动的情况,最近由于深层次学习技术,以及各种拟议的网络结构,我们看到了重大改进。在本文中,我们根据CNN、Unet结构和自省组件,实现了不同的结构结构。我们建议对这些结构进行若干修改,包括跳过连接的自知模块、取代自我注意的经常层以及同时预测多声调程度的多任务战略。我们对这些不同规模的结构的变异进行对比,用于多声调评估,我们用音乐网和Shubert Winterreiseise数据集来重点介绍比钢琴-声乐场情景以外的西方古典音乐。我们的实验表明,大多数结构产生竞争性结果,而更大的模型变异体似乎是有益的。然而,我们发现这些结果在很大程度上取决于随机化效果和特定培训测试的分化选择,这就必然质疑某种结构的优劣性,因此,在某种特定的结构中,我们需要对某种特定数据流流流流流流流中,因此对某种最佳的顺序进行评估。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日