In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called It$\hat{\text{o}}$TTS, and the model that generates wave is called It$\hat{\text{o}}$Wave. It$\hat{\text{o}}$TTS and It$\hat{\text{o}}$Wave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of It$\hat{\text{o}}$TTS and It$\hat{\text{o}}$Wave can exceed the current state-of-the-art methods, and reached 3.925$\pm$0.160 and 4.35$\pm$0.115 respectively. The generated audio samples are available at https://shiziqiang.github.io/ito_audio/. All authors contribute equally to this work.
翻译:在本文中, 我们提议将语音合成的两个方面, 即文本到语音( TTS) 和vocoder 合并成一个框架, 以一对前方和反向线性线性分解方程为基础。 这个 SDE 配对的解决方案是两个随机过程, 其中之一是将我们想要生成的光谱( 或波) 的分布转换成一个简单和可移动的分布。 另一个是将这个可移动的简单信号转换成目标Mel光谱( 或波) 的生成程序。 生成Mel光谱的模型叫做 I$\ hat\ text{ o_ $TTS, 而生成波的模型叫 It$\ hat text{ o_ text{ $} o $ wave。 其中之一是我们想要生成的光谱( 或波 或波 或波 ) 将 Wiener 进程作为驱动器, 逐渐减少这个声音信号的多余信号, 以产生现实的对应的线性线性光谱/ 和音频谱。 在原始的输入值$ 4. 或Mexlialtial_ lio=x_ salma_ slex