Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems. Since collecting MOS is time-consuming, it would be desirable if there are accurate MOS prediction models for automatic evaluation. In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech. And a proposed module is added to model the opinion score distribution of each utterance. With the proposed components, DDOS outperforms previous works on BVCC dataset. And the zero shot transfer result on BC2019 dataset is significantly improved. DDOS also wins second place in Interspeech 2022 VoiceMOS challenge in terms of system-level score.
翻译:平均意见评分(MOS)是语言合成系统的典型主观评价指标。由于收集MOS是耗时的,因此最好有准确的MOS预测模型进行自动评价。在这个工作中,我们提议DDOS,这是一个新的MOS预测模型。DDOS使用地区适应性培训前的训练,以在合成话语中进一步进行自我监督的自我培训学习模型。还添加了一个拟议模块,以模拟每个话语的评分分布。DDOS在拟议的组件上超过了BVCC数据集的先前工作。BC2019数据集的零弹射转移结果也大为改善。DDOS在2022 Interspeech 2022 VoiceMOS的系统评分方面也赢得第二位。