Note-level automatic music transcription is one of the most representative music information retrieval (MIR) tasks and has been studied for various instruments to understand music. However, due to the lack of high-quality labeled data, transcription of many instruments is still a challenging task. In particular, in the case of singing, it is difficult to find accurate notes due to its expressiveness in pitch, timbre, and dynamics. In this paper, we propose a method of finding note onsets of singing voice more accurately by leveraging the linguistic characteristics of singing, which are not seen in other instruments. The proposed model uses mel-scaled spectrogram and phonetic posteriorgram (PPG), a frame-wise likelihood of phoneme, as an input of the onset detection network while PPG is generated by the pre-trained network with singing and speech data. To verify how linguistic features affect onset detection, we compare the evaluation results through the dataset with different languages and divide onset types for detailed analysis. Our approach substantially improves the performance of singing transcription and therefore emphasizes the importance of linguistic features in singing analysis.
翻译:音符级别的自动音乐转录是最具代表性的音乐信息检索(MIR)任务之一,并已针对各种乐器进行研究以理解音乐。然而,由于缺乏高质量的标记数据,对许多乐器的转录仍然是一项具有挑战性的任务。特别是在唱歌的情况下,由于其在音高、音色和动态平衡方面的表现性,很难找到准确的音符。在本文中,我们提出了一种通过发挥唱歌中的语言特征来更准确地查找音符起始点的方法,这在其他乐器中无法看到。所提出的模型使用基于梅尔刻度的频谱图和音素后验概率图(PPG)作为起始检测网络的输入,而PPG是由使用唱歌和语音数据的预训练网络生成的。为了验证语言特征如何影响起始检测,我们通过具有不同语言的数据集进行评估结果的比较,并将起始类型分为详细分析。我们的方法显著提高了歌唱转录的性能,因此强调了语言特征在歌唱分析中的重要性。