In recent years, the accuracy of automatic lyrics alignment methods has increased considerably. Yet, many current approaches employ frameworks designed for automatic speech recognition (ASR) and do not exploit properties specific to music. Pitch is one important musical attribute of singing voice but it is often ignored by current systems as the lyrics content is considered independent of the pitch. In practice, however, there is a temporal correlation between the two as note starts often correlate with phoneme starts. At the same time the pitch is usually annotated with high temporal accuracy in ground truth data while the timing of lyrics is often only available at the line (or word) level. In this paper, we propose a multi-task learning approach for lyrics alignment that incorporates pitch and thus can make use of a new source of highly accurate temporal information. Our results show that the accuracy of the alignment result is indeed improved by our approach. As an additional contribution, we show that integrating boundary detection in the forced-alignment algorithm reduces cross-line errors, which improves the accuracy even further.
翻译:近年来,自动歌词校正方法的准确性有了显著提高。然而,许多现行方法都采用了自动语音识别(ASR)框架,而没有利用音乐特有的特性。 Pitch是歌声音乐中一个重要的音乐属性,但经常被当前系统忽略,因为歌词内容被视为独立于音调。然而,在实践中,这两个词之间的时间相关性随着音调的开始往往与调音启动相关。同时,音调通常在地面真实数据中以高时间准确度附加注释,而歌词的时间往往只在线(或单词)一级提供。在本文件中,我们建议对歌词校正采用多任务学习方法,以纳入音调,从而能够利用非常准确的时间信息的新来源。我们的结果表明,调和结果的准确性确实通过我们的方法得到了改进。作为额外的贡献,我们表明,将边界探测纳入强制对接算法会减少跨线错误,从而进一步提高准确性。