In this paper, we present a new objective prediction model for synthetic speech naturalness. It can be used to evaluate Text-To-Speech or Voice Conversion systems and works language independently. The model is trained end-to-end and based on a CNN-LSTM network that previously showed to give good results for speech quality estimation. We trained and tested the model on 16 different datasets, such as from the Blizzard Challenge and the Voice Conversion Challenge. Further, we show that the reliability of deep learning-based naturalness prediction can be improved by transfer learning from speech quality prediction models that are trained on objective POLQA scores. The proposed model is made publicly available and can, for example, be used to evaluate different TTS system configurations.
翻译:在本文中,我们介绍了合成言语自然性质的新客观预测模型,可用于独立评估文本到语音或语音转换系统和工作语言,该模型经过培训,以CNN-LSTM网络为基础,以往显示对语言质量估计有良好效果的CNN-LSTM网络为基础,在16个不同的数据集(如Blizzard Challenge和语音转换挑战)上培训和测试了该模型。此外,我们表明,通过从关于POLQA客观分数的培训的语音质量预测模型中传授知识,可以提高深层学习基础自然状态预测的可靠性。提议的模型可以公开提供,例如可用于评价不同的 TTS 系统配置。