In this paper, we propose a model to perform style transfer of speech to singing voice. Contrary to the previous signal processing-based methods, which require high-quality singing templates or phoneme synchronization, we explore a data-driven approach for the problem of converting natural speech to singing voice. We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody while preserving the speaker identity and naturalness. The proposed SymNet model is comprised of symmetrical stack of three types of layers - convolutional, transformer, and self-attention layers. The paper also explores novel data augmentation and generative loss annealing methods to facilitate the model training. Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice. In these experiments, we show that the proposed SymNet model improves the objective reconstruction quality significantly over the previously published methods and baseline architectures. Further, a subjective listening test confirms the improved quality of the audio obtained using the proposed approach (absolute improvement of 0.37 in mean opinion score measure over the baseline system).
翻译:在本文中,我们提出一种模式,将语言风格转换为歌声。与以前基于信号的处理方法相反,这需要高质量的歌唱模板或电话同步。我们探索一种数据驱动方法,解决将自然语言转换为歌唱的声音的问题。我们开发了一个新的神经网络结构,叫做SymNet,它将输入的语句与目标旋律相匹配,同时保持演讲者的身份和自然性。拟议的SymNet模式由三种层次的对称堆叠组成,即:进化层、变压器和自留层。本文还探讨了新的数据增强和基因损耗前置法,以促进示范培训。在NUS和NHSS数据集上进行了实验,其中包括语音和歌唱声的平行数据。在这些实验中,我们表明拟议的SymNet模型大大改进了先前出版的方法和基线结构的客观重建质量。此外,主观监听测试证实了使用拟议方法获得的音质质量的提高(基线系统平均对0.37的评分尺度进行了改进)。