Lyrics transcription of polyphonic music is challenging as the background music affects lyrics intelligibility. Typically, lyrics transcription can be performed by a two step pipeline, i.e. singing vocal extraction frontend, followed by a lyrics transcriber backend, where the frontend and backend are trained separately. Such a two step pipeline suffers from both imperfect vocal extraction and mismatch between frontend and backend. In this work, we propose a novel end-to-end integrated training framework, that we call PoLyScriber, to globally optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music. The experimental results show that our proposed integrated training model achieves substantial improvements over the existing approaches on publicly available test datasets.
翻译:由于背景音乐对歌词的洞察力有影响,多声音乐的文字笔录具有挑战性。 通常,歌词笔录可以通过两步管道进行, 即歌唱声抽取前端, 之后是歌词转录后端, 其前端和后端分开训练。 这样的两步曲录制既受声调提取不完善的影响,也受前端和后端不匹配的影响。 在这项工作中, 我们提议了一个全新的端对端综合培训框架, 我们称之为 PoLyScriber, 以优化全球的语音提取器前端和歌词转录后端, 用于多声音乐的歌词转录后端。 实验结果显示,我们拟议的综合培训模式大大改进了公开的测试数据集的现有方法。