Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.
翻译:目前大多数受监督的自动音乐转录(AMT)模式都缺乏普及能力。 这意味着它们难以翻译来自不同音乐流流的、在标签培训数据中没有显示的音乐流流的真实世界音乐记录。 在本文中,我们提出了一个半监督框架,即ReconVAT, 它通过利用现有大量无标签音乐录音来解决这个问题。 拟议的ReconVAT使用重建损失和虚拟对抗性培训。 当与现有的AMT的U-net模型相结合时,ReconVAT在诸如MAPS和MusicNet等通用基准数据集上取得了竞争性结果。 例如,在MusicNet的字符串部分的几张设置中,ReconVAT在备注和备注中分别实现了61.0%和41.6%的F1分数, 这相当于与受监督的基准模型相比的22.2%和62.5%的改进率。 我们提议的框架还展示了不断学习新数据的潜力,这在现实应用中可能有用,从而不断获得新的数据。