Several studies have proposed deep-learning-based models to predict the mean opinion score (MOS) of synthesized speech, showing the possibility of replacing human raters. However, inter- and intra-rater variability in MOSs makes it hard to ensure the high performance of the models. In this paper, we propose a multi-task learning (MTL) method to improve the performance of a MOS prediction model using the following two auxiliary tasks: spoofing detection (SD) and spoofing type classification (STC). Besides, we use the focal loss to maximize the synergy between SD and STC for MOS prediction. Experiments using the MOS evaluation results of the Voice Conversion Challenge 2018 show that proposed MTL with two auxiliary tasks improves MOS prediction. Our proposed model achieves up to 11.6% relative improvement in performance over the baseline model.
翻译:几项研究提出了基于深层次学习的模型,以预测合成言词的平均意见评分(MOS),显示了替换人速率的可能性。然而,在MOS中,跨河和跨河者的变异性使得难以确保模型的高性能。在本文中,我们提出了一个多任务学习(MTL)方法,用以下两项辅助任务改进MOS预测模型的性能:潜入检测(SD)和潜入类型分类(STC)。此外,我们利用中心损失来最大限度地发挥SD和STC在MOS预测方面的协同作用。 2018年语音转换挑战的MOS评估结果实验显示,用两个辅助任务提议的MTL改进了MOS预测。我们提议的模型比基线模型的性能提高了11.6%。