Although recent neural text-to-speech (TTS) systems have achieved high-quality speech synthesis, there are cases where a TTS system generates low-quality speech, mainly caused by limited training data or information loss during knowledge distillation. Therefore, we propose a novel method to improve speech quality by training a TTS model under the supervision of perceptual loss, which measures the distance between the maximum possible speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model to maximize the MOS of synthesized speech using the pre-trained MOS prediction model. The proposed method can be applied universally (i.e., regardless of the TTS model architecture or the cause of speech quality degradation) and efficiently (i.e., without increasing the inference time or model complexity). The evaluation results for the MOS and phone error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
翻译:虽然最近神经文本到语音系统实现了高质量的语音合成,但在有些情况下,TTS系统生成了低质量的语音,主要原因是培训数据有限或在知识蒸馏过程中信息丢失。因此,我们提出一种新的方法,通过在感官损失监督下培训TTS模型来提高语音质量,该模型测量尽可能高的语音质量分数与预测值之间的距离。我们首先对中值意见分(MOS)预测模型进行了预先培训,然后对TTS模型进行了培训,以便利用预先培训的MOS预测模型使合成语音的MOS最大化。提议的方法可以普遍适用(即不管TTS模型结构或语言质量退化的原因如何)和有效(即不增加推论时间或模型的复杂性)。MOS和电话错误率的评价结果表明,我们提出的方法在自然性和智能性两方面都改善了以前的模型。