Automatic music transcription (AMT) is the task of transcribing audio recordings into symbolic representations. Recently, neural network-based methods have been applied to AMT, and have achieved state-of-the-art results. However, many previous systems only detect the onset and offset of notes frame-wise, so the transcription resolution is limited to the frame hop size. There is a lack of research on using different strategies to encode onset and offset targets for training. In addition, previous AMT systems are sensitive to the misaligned onset and offset labels of audio recordings. Furthermore, there are limited researches on sustain pedal transcription on large-scale datasets. In this article, we propose a high-resolution AMT system trained by regressing precise onset and offset times of piano notes. At inference, we propose an algorithm to analytically calculate the precise onset and offset times of piano notes and pedal events. We show that our AMT system is robust to the misaligned onset and offset labels compared to previous systems. Our proposed system achieves an onset F1 of 96.72% on the MAESTRO dataset, outperforming previous onsets and frames system of 94.80%. Our system achieves a pedal onset F1 score of 91.86\%, which is the first benchmark result on the MAESTRO dataset. We have released the source code and checkpoints of our work at https://github.com/bytedance/piano_transcription.
翻译:自动音乐转录(AMT)是将音频记录转换为象征性表示的任务。最近,对AMT应用了基于神经网络的方法,并取得了最新结果。然而,许多先前的系统仅检测音量的开始和抵消,因此,转录解析度仅限于框架跳跳体大小。缺乏关于使用不同战略来编码启动和抵消培训目标的研究。此外,以往的AMT系统对音量录音的错误开始和抵消标签十分敏感。此外,对大型数据集上持续线性转录的神经网络方法的研究有限。在本篇文章中,我们提出一个高分辨率的AMT系统,通过递减精确开始和抵消音量记录的时间来培训。据推测,我们建议采用一种算法来分析准确的开始时间和抵消钢琴笔数和事件的时间。我们所建立的AMT系统对错误的开始和取消标记比以往系统要强。我们提议的系统在MAESRO数据集中实现了96.72%的开始F. 80在MAESA标准中, 之前的开始和开始时间框架将我们的系统评分数。