一个Melody-Uvis监督的语音合成歌唱模式 (A Melody-Unsupervision Model for Singing Voice Synthesis)

Recent studies in singing voice synthesis have achieved high-quality results leveraging advances in text-to-speech models based on deep neural networks. One of the main issues in training singing voice synthesis models is that they require melody and lyric labels to be temporally aligned with audio data. The temporal alignment is a time-exhausting manual work in preparing for the training data. To address the issue, we propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time but generates singing voice audio given a melody and lyrics input in inference time. The proposed model is composed of a phoneme classifier and a singing voice generator jointly trained in an end-to-end manner. The model can be fine-tuned by adjusting the amount of supervision with temporally aligned melody labels. Through experiments in melody-unsupervision and semi-supervision settings, we compare the audio quality of synthesized singing voice. We also show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.

翻译：最近在歌声合成方面的研究取得了高质量的成果,利用了基于深层神经网络的文本到语音模型的进步。在培训歌声合成模型方面的主要问题之一是,它们需要音频合成模型与音频数据在时间上保持一致。时间对齐是准备培训数据的一项耗时的手工工作。为了解决这个问题,我们提议了一个旋律-不监督模型,该模型仅需要没有时间在培训时间上对调的音频和音频对齐,但根据一段旋律和歌词输入时间来生成歌声音音。拟议模型由语音分类器和音频生成器组成,以端对音频数据进行联合培训。该模型可以通过调整与时间对调的旋律标签的监管量来进行微调。通过在旋律-无超视和半超视环境中的实验,我们比较合成歌声的音质。我们还表明,拟议模型能够接受语音和文字标签的培训,但能够产生音频声音反射时间。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/