Automatic methods to predict Mean Opinion Score (MOS) of listeners have been researched to assure the quality of Text-to-Speech systems. Many previous studies focus on architectural advances (e.g. MBNet, LDNet, etc.) to capture relations between spectral features and MOS in a more effective way and achieved high accuracy. However, the optimal representation in terms of generalization capability still largely remains unknown. To this end, we compare the performance of Self-Supervised Learning (SSL) features obtained by the wav2vec framework to that of spectral features such as magnitude of spectrogram and melspectrogram. Moreover, we propose to combine the SSL features and features which we believe to retain essential information to the automatic MOS to compensate each other for their drawbacks. We conduct comprehensive experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges. We found that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable. Furthermore, we found that the combinations performed the best and analyzed how they bridged the gap between spectral and the wav2vec feature sets.
翻译:已经对预测听众平均意见评分(MOS)的自动方法进行了研究,以确保文本到语音系统的质量。以前的许多研究都侧重于建筑进步(例如MBNet、LDNet等),以便以更有效的方式并实现高度精确地捕捉光谱特征和MOS之间的关系。然而,一般化能力方面的最佳代表性仍然大都未知。为此,我们比较了Wav2vec框架获得的自我监督学习(SS)特征的性能与光谱特征(如光谱和光谱和光谱等)的性能。此外,我们提议将我们认为保留基本信息给自动MOS以补偿其缺陷的SS特征和特征结合起来。我们对从过去的Blizzard和语音转换挑战中收集的大型监听测试系统进行了全面实验。我们发现,Wav2vec特征组显示最精确的普及性,即使给定的地面图谱并非始终可靠。我们发现,这些组合进行了最佳的组合,并分析了它们如何在地谱2 和光谱之间架隔开。