Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.
翻译:语音MOS 2022 挑战提供了合成语音转换和文本到语音样本的数据集,并附有主观标签。这项研究审视了元数据和数据集分布不平衡对语音质量的主观评级所可以解释的差异程度。语音质量模型是使用 wav2vec 2.0 构建的,并增加了元数据特征,包括评级组和系统标识,并获得了具有竞争力的衡量标准,其中包括系统一级的Spearman等级相关系数0.934和0.088,以及0.877和0.198。使用受测试限制或盲目的数据和元数据进一步改进了计量标准。元数据分析表明,由于在验证和测试数据集中每个系统使用的语音数量差异很大,系统级指标并不代表模型的系统级预测。我们的结论是,总体而言,测试集中应有足够的表达能力来约束样本中度错误,而在精确度解释中则比较可靠。