Nowadays, most of the objective speech quality assessment tools (e.g., perceptual evaluation of speech quality (PESQ)) are based on the comparison of the degraded/processed speech with its clean counterpart. The need of a "golden" reference considerably restricts the practicality of such assessment tools in real-world scenarios since the clean reference usually cannot be accessed. On the other hand, human beings can readily evaluate the speech quality without any reference (e.g., mean opinion score (MOS) tests), implying the existence of an objective and non-intrusive (no clean reference needed) quality assessment mechanism. In this study, we propose a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory. The evaluation of utterance-level quality in Quality-Net is based on the frame-level assessment. Frame constraints and sensible initializations of forget gate biases are applied to learn meaningful frame-level quality assessment from the utterance-level quality label. Experimental results show that Quality-Net can yield high correlation to PESQ (0.9 for the noisy speech and 0.84 for the speech processed by speech enhancement). We believe that Quality-Net has potential to be used in a wide variety of applications of speech signal processing.
翻译:目前,大多数客观的言论质量评估工具(例如,对语言质量的感知性评价(PESQ))都基于对退化/处理的言论质量评估机制的比较,在这项研究中,我们建议采用一个新的终端至终端、非侵扰性语言质量评估模式,称为质量-网络,基于双向长期记忆,在现实世界情景中,这种评估工具的实用性受到相当大的限制,因为清洁的参考通常无法获得;另一方面,人们可以随时评价语言质量,而不作任何参考(例如,平均意见评分(MOS)测试),这意味着存在一个客观和非侵扰性(不需要清洁参考)的言论质量评估机制;在本项研究中,我们提出一个新的终端至终端、非侵扰性语言质量评估模式,称为质量-网络,以双向长期记忆为基础,大大限制了这种评估工具在现实世界情景中的实际实用性;对质量-质量的评估,基于框架一级的评估,框架限制和对遗忘门偏差的合理初始化,用于从超音级质量标签中学习有意义的框架级质量评估。实验结果显示,质量-网络能够产生高端对面演讲质量的高度对应性处理。