An effective approach to automatically predict the subjective rating for synthetic speech is to train on a listening test dataset with human-annotated scores. Although each speech sample in the dataset is rated by several listeners, most previous works only used the mean score as the training target. In this work, we present LDNet, a unified framework for mean opinion score (MOS) prediction that predicts the listener-wise perceived quality given the input speech and the listener identity. We reflect recent advances in LD modeling, including design choices of the model architecture, and propose two inference methods that provide more stable results and efficient computation. We conduct systematic experiments on the voice conversion challenge (VCC) 2018 benchmark and a newly collected large-scale MOS dataset, providing an in-depth analysis of the proposed framework. Results show that the mean listener inference method is a better way to utilize the mean scores, whose effectiveness is more obvious when having more ratings per sample.
翻译:自动预测合成演讲的主观评级的有效办法是,用人类附加说明的分数来训练听觉测试数据集。虽然数据集中的每个语音样本都由几个听众评分,但大多数以前的工作只用平均分数作为培训目标。在这项工作中,我们提出了LDNet,这是一个平均意见评分(MOS)预测的统一框架,根据输入演讲和听众身份预测听众认为的质量。我们反映了在LD建模方面的最新进展,包括模型结构的设计选择,并提出两种推论方法,提供更稳定的结果和效率的计算。我们对2018年语音转换(VCC)基准和新收集的大规模MOS数据集进行系统实验,对拟议框架进行深入分析。结果显示,平均听众推断法是使用平均分数的更好方法,如果每个样本的评分更多,其效力就更加明显。