In this paper, a text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rapping/singing synthesis with the target's speaker voice. The detailed pipeline of the system is described, which includes the extraction of the target pitch and duration values from an a capella song and their conversion into target speaker's valid range of notes before synthesis. An additional stage of prosodic manipulation of the output via WSOLA is also investigated for better matching the target duration values. The synthesized utterances can be mixed with an instrumental accompaniment track to produce a complete song. The proposed system is evaluated via subjective listening tests as well as in comparison to an available alternate system which also aims to produce synthetic singing voice from read-only training data. Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.
翻译:在本文中,引入了一个文本到编程/制片系统,可以根据任何发言者的声音加以调整。它使用一个基于塔可坦的多声音声音模型,该模型以只读语音数据为培训对象,提供语音控制;还调查数据集增强和基于传统的DSP算法的额外假动作操纵;神经TTS模型微调到一个隐蔽的发言者有限的录音,允许与目标的发言者声音进行录音/合成。系统的详细管道被描述,其中包括从卡贝拉歌曲中提取目标音道和持续时间值,并将其转换成目标发言者的有效音量。还调查了通过WSOLA对输出进行进一步推进操纵的阶段,以更好地匹配目标持续时间值。合成音调可以混合成一种工具相容的音轨来制作完整的歌曲。拟议系统通过主观监听测试来评价,并与现有的替代系统作比较,后者也旨在从读/只读方法的高质量培训中产生合成声频声音。拟议的结果显示,拟议的组合能够通过高性语言质量生成高性数据的组合。