Speech quality assessment has been a critical component in many voice communication related applications such as telephony and online conferencing. Traditional intrusive speech quality assessment requires the clean reference of the degraded utterance to provide an accurate quality measurement. This requirement limits the usability of these methods in real-world scenarios. On the other hand, non-intrusive subjective measurement is the ``golden standard" in evaluating speech quality as human listeners can intrinsically evaluate the quality of any degraded speech with ease. In this paper, we propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge. Our experiments show that CCAT provides promising MOS predictions compared to current state-of-art non-intrusive speech assessment models with average Pearson correlation coefficient (PCC) increasing from 0.530 to 0.697 and average RMSE decreasing from 0.768 to 0.570 compared to the baseline model on the challenge evaluation test set.
翻译:在电话和在线会议等许多语音通信应用中,语音质量评估一直是许多语音通信相关应用(如电话和在线会议)的一个关键组成部分。传统的侵扰性语音质量评估要求清洁地参考退化的言论,以提供准确的质量衡量。这一要求限制了这些方法在现实世界情景中的可用性。另一方面,非侵扰性主观测量是评价语言质量的“golden标准 ”,因为人类听众可以轻松地从本质上评估任何退化的言论的质量。在本文中,我们提议建立一个新型端对端模式结构,称为“CCAT ” 网络,以预测人类计票员的平均意见得分(MOS ) 。我们评估了三个MOS附加说明的多语种和扭曲类型数据集的模型,并将我们的结果提交2022挑战。我们的实验显示,CCAT提供了有希望的MOS预测,与目前最先进的非侵扰性语言评估模型相比,平均Pearson相关系数从0.530上升到0.697,平均RME从0.768下降到0.570,与挑战测试基准模型相比从0.570。