Perceptual speech quality is an important performance metric for teleconferencing applications. The mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to rate the quality of a speech sample. Recently, there has been increasing research interest in developing models for estimating MOS blindly. Here we propose a multi-task framework to include additional labels and data in training to improve the performance of a blind MOS estimation model. Experimental results indicate that the proposed model can be trained to jointly estimate MOS, reverberation time (T60), and clarity (C50) by combining two disjoint data sets in training, one containing only MOS labels and the other containing only T60 and C50 labels. Furthermore, we use a semi-supervised framework to combine two MOS data sets in training, one containing only MOS labels (per ITU-T Recommendation P.808), and the other containing separate scores for speech signal, background noise, and overall quality (per ITU-T Recommendation P.835). Finally, we present preliminary results for addressing individual rater bias in the MOS labels.
翻译:视觉语言质量是远程会议应用的一个重要性能衡量标准。平均意见评分(MOS)对于语言质量的认知性评价是标准化的,通过要求听众评定语言质量样本的质量而获得。最近,对开发模型盲目估计MOS的研究兴趣日益增强。我们在这里提议了一个多任务框架,在培训中增加标签和数据,以提高盲目的MOS估计模型的性能。实验结果表明,可以通过将培训中的两个脱节数据集(一个只包含MOS标签,另一个仅包含T60和C50标签)合并起来,对拟议模型进行培训,以共同估计MOS、回响时间(T60)和清晰度(C50),最后,我们提出了在MOS标签中处理个人电算率偏差的初步结果。此外,我们使用半封闭的框架将两个MOS数据集(一个仅包含MOS标签,一个包含MOS标签(根据ITU-T建议P.808),另一个包含语音信号、背景噪音和总体质量分数(根据ITU-T建议P.835)。