Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code and audio samples will be available at https://github.com/anonymous-pits/pits.
翻译:先前的平方控制的文本到语音模型(TTS)依靠直接模拟基本频率,导致合成语句的低差异。为了解决这一问题,我们提议PITS,即终端到终端的可控制语句TTTS模型,利用变异的推论进行模型。基于VITS,PITS包括了Yingram编码器、Yingram解码器和对立式组合培训,以获得音频控制。实验表明PITS生成了无法与地面真话区分的高质量语言,并且具有高音频控制的TTTS模型,没有质量的降解。代码和音频样本将在https://github.com/anonymoous-pits/pits上提供。</s>