Lyrics recognition is an important task in music processing. Despite the great number of traditional algorithms such as the hybrid HMM-TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models. We evaluate four upstream SSL models based on their training method (masked reconstruction, masked prediction, autoregressive reconstruction, contrastive model). After applying the SSL model, the best performance improved by 5.23% for the dev set and 2.4% for the test set compared with the previous state-of-art baseline system even without a language model trained by a large corpus. Moreover, we study the generalization ability of the SSL features considering that those models were not trained on music datasets.
翻译:歌词识别是音乐处理的一项重要任务。尽管许多传统算法,如混合的HMM-TDNN模型取得了良好的表现,但关于应用端对端模型和自我监督学习的研究有限。在本文件中,我们首先为歌词识别建立一个端对端基线,然后探索SSL模型的性能。我们根据它们的培训方法评估了四个上游的SSL模型(模拟重建、蒙面预测、自动反向重建、对比模型)。在应用了SSL模型之后,与以前由大型人员培训的语言模型相比,Dev集的最佳性能提高了5.23%,测试集提高了2.4%。此外,考虑到这些模型没有接受音乐数据集的培训,我们研究了SSL特征的普遍化能力。