The current lyrics transcription approaches heavily rely on supervised learning with labeled data, but such data are scarce and manual labeling of singing is expensive. How to benefit from unlabeled data and alleviate limited data problem have not been explored for lyrics transcription. We propose the first semi-supervised lyrics transcription paradigm, Self-Transcriber, by leveraging on unlabeled data using self-training with noisy student augmentation. We attempt to demonstrate the possibility of lyrics transcription with a few amount of labeled data. Self-Transcriber generates pseudo labels of the unlabeled singing using teacher model, and augments pseudo-labels to the labeled data for student model update with both self-training and supervised training losses. This work closes the gap between supervised and semi-supervised learning as well as opens doors for few-shot learning of lyrics transcription. Our experiments show that our approach using only 12.7 hours of labeled data achieves competitive performance compared with the supervised approaches trained on 149.1 hours of labeled data for lyrics transcription.
翻译:目前歌词转录法在很大程度上依赖使用标签数据进行监管学习,但这类数据稀缺,人工歌唱标签费用昂贵。在歌词转录方面,没有探讨如何从未贴标签数据中受益和缓解有限的数据问题。我们建议采用使用噪音学生扩增的自我培训,利用非标签数据,以此来利用首个半监督的歌词转录模式“自我翻译”(Self-Transcriber ) 。我们试图展示用少量标签数据进行词典转录的可能性。自译自审生成了使用教师模式的未贴标签歌曲假标签,并增加了学生模式更新的标签数据中的伪标签,包括自我培训和受监督的培训损失。这项工作缩小了受监管的和半监督的学习之间的差距,并为几发歌词转录的学习打开了大门。我们的实验表明,我们仅使用12.7小时的标签数据的方法与在149.1小时的标签数据中培训的方法相比,取得了竞争性的业绩。