In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called It\^oTTS, and the model that generates wave is called It\^oWave. It\^oTTS and It\^oWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of It\^oTTS and It\^oWave can exceed the current state-of-the-art methods, and reached 3.925$\pm$0.160 and 4.35$\pm$0.115 respectively. The generated audio samples are available at https://wushoule.github.io/ItoAudio/. All authors contribute equally to this work.
翻译:在本文中,我们提议将语音合成的两个方面,即文本到语音(TTS)和vocoder(vocoder)合并成一个框架,其基础是一对前方和反向线性线性随机差异方程(SDE)。SDE对的解决方案是两个随机过程,其中之一是将我们想要生成的mel光谱(或波)的分布转换成一个简单和可移动的分布。另一个是将这一可移动的简单信号转换成目标Mel光谱(或波)的生成程序。生成Mel光谱的模型称为IT&oTTS,而生成波的模型称为ItáoWave。It ⁇ oTTS和ItooWave使用Wiener进程作为驱动器,逐渐从噪音信号中减去多余的信号,在原始文本或Mel光谱的有条件投入下产生现实对应的有意义的光谱和音频。实验结果显示,在IT&T_TAT$/Ix_25和I_WO的样本中,所有作者都能够超过当前状态。