Although recent end-to-end text-to-speech (TTS) systems have achieved high-quality synthesized speech, there are still several factors that degrade the quality of synthesized speech, including lack of training data or information loss during knowledge distillation. To address the problem, we propose a novel way to train a TTS model under the supervision of perceptual loss, which measures the distance between the maximum speech quality score and the predicted one. We first pre-train a mean opinion score (MOS) prediction model and then train a TTS model in the direction of maximizing the MOS of synthesized speech predicted by the pre-trained MOS prediction model. Through this method, we can improve the quality of synthesized speech universally (i.e., regardless of the network architecture or the cause of the speech quality degradation) and efficiently (i.e., without increasing the inference time or the model complexity). The evaluation results for MOS and phoneme error rate demonstrate that our proposed approach improves previous models in terms of both naturalness and intelligibility.
翻译:虽然最近端到端的文本到语音系统取得了高质量的合成语音,但仍有若干因素降低了合成语音的质量,包括缺乏培训数据或知识蒸馏过程中信息丢失。为了解决这个问题,我们提出一种新的方法,在感官损失的监督下培训TTS模型,以测量最高语音质量评分与预测值之间的距离。我们首先对中值意见评分(MOS)预测模型进行预演,然后对TTS模型进行培训,以尽量扩大预先培训的MOS预测模型预测的综合语音的MOS。通过这种方法,我们可以普遍提高合成语音的质量(即无论网络结构或语言质量退化的原因如何),并提高效率(即不增加发酵时间或模型的复杂性)。MOS和电话错误率的评价结果表明,我们拟议的方法在自然性和智能性方面改进了以前的模型。