This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech model with a fully differentiable duration model which does not require supervised duration signals. The duration model is based on a novel attention mechanism and an iterative reconstruction loss based on Soft Dynamic Time Warping, this model can learn token-frame alignments as well as token durations automatically. Experimental results show that Parallel Tacotron 2 outperforms baselines in subjective naturalness in several diverse multi speaker evaluations. Its duration control capability is also demonstrated.
翻译:本文件介绍平行塔可罗2号,这是一个非航空神经文本到语音模型,具有完全不同的持续时间模型,不需要有监督的持续时间信号。持续时间模型基于一个新的关注机制和基于软动态时间扭曲的迭代重建损失,该模型可以自动学习代号-框架调整和象征性持续时间。实验结果表明,平行塔可罗2号在多个多发言者评价中主观自然性优于基线。其持续时间控制能力也得到了展示。