Estimating quality of transmitted speech is known to be a non-trivial task. While traditionally, test participants are asked to rate the quality of samples; nowadays, automated methods are available. These methods can be divided into: 1) intrusive models, which use both, the original and the degraded signals, and 2) non-intrusive models, which only require the degraded signal. Recently, non-intrusive models based on neural networks showed to outperform signal processing based models. However, the advantages of deep learning based models come with the cost of being more challenging to interpret. To get more insight into the prediction models the non-intrusive speech quality prediction model NISQA is analyzed in this paper. NISQA is composed of a convolutional neural network (CNN) and a recurrent neural network (RNN). The task of the CNN is to compute relevant features for the speech quality prediction on a frame level, while the RNN models time-dependencies between the individual speech frames. Different explanation algorithms are used to understand the automatically learned features of the CNN. In this way, several interpretable features could be identified, such as the sensitivity to noise or strong interruptions. On the other hand, it was found that multiple features carry redundant information.
翻译:虽然传统上要求测试对象对样本质量进行评分;如今,有自动化的方法。这些方法可以分为:1)使用原始和退化信号的侵扰性模型,以及2)仅需要退化信号的非侵扰性模型。最近,神经网络上的非侵扰性模型显示超越信号处理模型,而神经网络上的非侵扰性模型显示超越了信号处理模型。然而,深层次学习模型的优点是,解释成本更具有挑战性。为了更深入地了解预测模型,本文分析了非侵扰性语音质量预测模型NISQA。NISQA是由一个革命性神经网络(CNN)和一个经常性神经网络(RNNN)组成的。CNN的任务是在框架水平上对语音质量预测的相关特征进行编译,而RNNN模型在单个语音框架之间的时间依赖性则使用不同的解释算法来理解CNNC自动学习的特征。通过这种方式可以找到一些可解释的特征,通过这种方式可以识别的特性,例如:一个革命性神经网络(CNNNNN)和一个经常性的神经网络(RNNNNNN)网络(RNNNNN)网络(RNNNNN)的多重敏感度。