This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.
翻译:本文介绍了基于Tacontron 2 文本到语音模型的非加速调子, 以明确的时间预测器取代关注机制。 这大大提高了以不统一的时间比和字删除率衡量的稳健性, 本文采用了两个尺度, 用于使用预先培训的语音识别模型进行大规模稳健性评估。 使用Gaussian 自动校验, 非加速调子可达到5级的自然度平均评分4. 41, 略高于2级。 持续时间预测器既能让整个语句,又能让每个电话在推论时间控制持续时间。 当培训数据中缺少或无法提供准确的目标持续时间时, 我们建议一种方法, 使用精细的变式自动编码, 以半监督或未监督的方式对持续时间预测器进行培训, 其结果几乎与监督的培训一样良好。