Automatic speech recognition (ASR) has progressed significantly in recent years due to the emergence of large-scale datasets and the self-supervised learning (SSL) paradigm. However, as its counterpart problem in the singing domain, the development of automatic lyric transcription (ALT) suffers from limited data and degraded intelligibility of sung lyrics. To fill in the performance gap between ALT and ASR, we attempt to exploit the similarities between speech and singing. In this work, we propose a transfer-learning-based ALT solution that takes advantage of these similarities by adapting wav2vec 2.0, an SSL ASR model, to the singing domain. We maximize the effectiveness of transfer learning by exploring the influence of different transfer starting points. We further enhance the performance by extending the original CTC model to a hybrid CTC/attention model. Our method surpasses previous approaches by a large margin on various ALT benchmark datasets. Further experiments show that, with even a tiny proportion of training data, our method still achieves competitive performance.
翻译:近年来,由于出现了大规模数据集和自我监督的学习模式,自动语音识别(ASR)取得了显著进展,然而,作为在歌唱领域的对应问题,自动读音(ALT)的开发因数据有限和唱歌声的智能退化而受到影响。为了填补ALT和ASR之间的性能差距,我们试图利用演讲和唱歌之间的相似之处。在这项工作中,我们提议了一个基于转移学习的ALT解决方案,利用这些相似之处,将SSL ASR模式W22.0改编为歌唱领域。我们通过探索不同转移起点的影响,最大限度地发挥转让学习的实效。我们通过将原CTC模式扩展至混合的CTC/注意模式,进一步提高了绩效。我们的方法超过了以往在各种ALT基准数据集上的巨大优势。进一步实验显示,即使培训数据的比例很小,我们的方法仍然具有竞争性。