Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In addition to multitasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.
翻译:现有的歌声合成模型(SVS)通常在歌唱数据方面接受培训,并且取决于容易出错的时间调整和持续时间特点,或者明确的音乐评分信息。在本文中,我们提议卡拉奥克(Karaoker),这是一个以声音特征为条件的多方言的塔科坦(Tacocron)模型,专门以口述数据为条件的培训,而不需要时间比对。卡拉奥克(Karaoker)根据从一个看不见歌手/讲演者的源波形中提取的多维的模版合成了歌声和传音风格。该模型与一个单一的深层共变相编码器一起,以连续数据为条件,包括声频、强度、调力、成形体、中枢峰突出和八角等。我们把文本到语音的培训目标扩展为特征重建、分类和语音识别任务,引导模型取得准确的结果。除了多功能外,我们还采用瓦瑟斯坦GAN培训计划,以及声学模型产出的新损失,以进一步改进模型的质量。