Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.
翻译:目前,自动合成语音评价的最新方法以MOS预测神经模型为基础。这种MOS预测模型包括MOSNet和LDNet,它们使用光谱特征作为输入,SSL-MOS依靠事先训练的自我监督学习模型,直接使用语音信号作为输入。在现代高品质神经TTS系统中,对口语内容的适当性是言论自然性的一个决定性因素。为此原因,我们提议在MOS预测系统中增加Prosodic和语言特征作为补充投入,并评价其对预测结果的影响。我们认为,电话S级F0和持续时间特征是推进性投入,Tacotron编码器输出、POS标记和BERT嵌入为更高层次的语言输入。所有MOS预测系统都接受SS培训,一个只有神经 TTS数据集,由人群源自然性MOS评估。结果显示,通过改进预测MOS分数与地面真相的关联,无论是在彻底一级还是系统一级,拟议的额外特征对MOS预测都有好处。