Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.
翻译:提出了几个完全端到端的文本到语音(TTS)模型,这些模型与级联模型相比表现更好(即分别培训声学和电码模型),但是,当数据集包含情感属性,即发音和发音差异很大时,这些模型往往产生不稳定的音轨轮廓,与可听的人工制品相交。为了解决这一问题,我们提议了阶段VITS,这是一个包含一个明确周期生成器的新颖的端到端 TTS模型。在拟议方法中,我们引入了一个框架投影预测器,从输入文本中预测出诸如投影和标旗等预示特征。从这些特征中,提议的周期生成一个样本级的正弦素源,使波形分解器能够准确地复制投影。最后,整个模型以端到端的方式共同优化,同时采用变异的推断和对称目标。因此,脱影器能够产生更稳定、直径和自然输出波变形。实验结果显示,拟议的模型在自然基值模型中大大超越了稳定性。