The quality of the speech communication systems, which include noise suppression algorithms, are typically evaluated in laboratory experiments according to the ITU-T Rec. P.835, in which participants rate background noise, speech signal, and overall quality separately. This paper introduces an open-source toolkit for conducting subjective quality evaluation of noise suppressed speech in crowdsourcing. We followed the ITU-T Rec. P.835, and P.808 and highly automate the process to prevent moderator's error. To assess the validity of our evaluation method, we compared the Mean Opinion Scores (MOS), calculate using ratings collected with our implementation, and the MOS values from a standard laboratory experiment conducted according to the ITU-T Rec P.835. Results show a high validity in all three scales namely background noise, speech signal and overall quality (average PCC = 0.961). Results of a round-robin test (N=5) showed that our implementation is also a highly reproducible evaluation method (PCC=0.99). Finally, we used our implementation in the INTERSPEECH 2021 Deep Noise Suppression Challenge as the primary evaluation metric, which demonstrates it is practical to use at scale. The results are analyzed to determine why the overall performance was the best in terms of background noise and speech quality.
翻译:语音通信系统的质量,包括噪音抑制算法,通常根据ITU-T Rec.P.835在实验室实验中评估,根据ITU-T Rec.P.835,参与者分别评估背景噪音、语音信号和总体质量,本文件介绍了对众包中抑制噪音的演讲进行主观质量评价的开放源码工具包。我们遵循ITU-T Rec.P.835和P.808,并高度自动化了程序以防止主持人的错误。为了评估我们的评价方法的有效性,我们比较了平均意见评分(MOS),计算时使用了执行过程中收集的评分,以及根据ITU-T Rec.P.835进行的标准实验室试验得出的MOS值。结果显示,所有三个尺度,即背景噪音、语音信号和总体质量(平均PCC=0.961)都具有高度有效性。我们遵循了IT-T Reporting(N=5)的测试结果显示,我们的实施过程也是一个高度可复制的评价方法(PCC=0.99)。最后,我们用INSPEECH 2021 深噪音抑制挑战软件作为主要评价指标,显示其背景质量。