We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.
翻译:我们展示了第一版《语音MOS挑战》,这是一次科学活动,旨在促进研究合成演讲平均评分(MOS)的自动预测,这项挑战吸引了来自学术界和工业界的22个参与团队,他们尝试了各种办法解决预测合成演讲的人的评分问题。 挑战主要轨道的监听测试数据包括187个不同文本对语音和语音转换系统的样本,覆盖了10年的研究,而外向轨道则包括了在单独监听测试中被评为最近系统的数据。 挑战的结果表明,微调自我监督的演讲模型对MOS预测任务的有效性,以及预测对隐形演讲者和听众的MOS评分以及外向环境中的无形系统的预测难度。