We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tests. Our system is based on ensemble learning of strong and weak learners. Strong learners incorporate several improvements to the previous fine-tuning models of self-supervised learning (SSL) models, while weak learners use basic machine-learning methods to predict scores from SSL features. In the Challenge, our system had the highest score on several metrics for both the main and OOD tracks. In addition, we conducted ablation studies to investigate the effectiveness of our proposed methods.
翻译:我们展示了提交给2022年声音MOS挑战的UTokyo-SaruLab平均意见评分(MOS)预测系统。我们面临的挑战是预测从先前的Blizzard挑战和声音转换挑战中收集的语音样本的两个轨道的MOS值:一个主路是内部预测,另一个是外路(OOOD),不同监听测试的数据标签较少。我们的系统基于强弱学生的混合学习。强大的学习者包括了以前自我监督学习模型微调模型的若干改进,而弱学习者则使用基本的机器学习方法来预测从SSL的得分。在“挑战”中,我们的系统在主路和OOOD轨道上的若干指标上得分最高。此外,我们还进行了相关研究,以调查我们拟议方法的有效性。