Automatic speech recognition (ASR) has progressed significantly in recent years due to large-scale datasets and the paradigm of self-supervised learning (SSL) methods. However, as its counterpart problem in the singing domain, automatic lyric transcription (ALT) suffers from limited data and degraded intelligibility of sung lyrics, which has caused it to develop at a slower pace. To fill in the performance gap between ALT and ASR, we attempt to exploit the similarities between speech and singing. In this work, we propose a transfer-learning-based ALT solution that takes advantage of these similarities by adapting wav2vec 2.0, an SSL ASR model, to the singing domain. We maximize the effectiveness of transfer learning by exploring the influence of different transfer starting points. We further enhance the performance by extending the original CTC model to a hybrid CTC/attention model. Our method surpasses previous approaches by a large margin on various ALT benchmark datasets. Further experiment shows that, with even a tiny proportion of training data, our method still achieves competitive performance.
翻译:近年来,由于大规模数据集和自我监督学习方法的范式,自动语音识别(ASR)取得了显著进展;然而,由于其在歌唱领域的对应问题,自动歌词抄录(ALT)数据有限,而且歌词的智能退化,导致歌词的开发速度较慢;为了填补ALT和ASR之间的性能差距,我们试图利用演讲和歌唱之间的相似之处;在这项工作中,我们提出了一个基于转移学习的ALT解决方案,利用这些相似之处,将SSL ALT 2.0(SSL AS) 模型改制为歌唱领域。我们通过探索不同调用起点的影响,最大限度地发挥转让学习的实效。我们通过将原CTC模式推广到混合的CTC/注意模式,进一步提高了绩效。我们的方法大大超过以往在各种ALT基准数据集上采用的方法。进一步实验表明,即使培训数据的比例极小,我们的方法仍然具有竞争性。