Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.
翻译:缺乏附加说明的数据和缺乏参考信号是设计高效质量评估指标的一些主要挑战。在本文件中,我们提出了两个多任务模型来解决上述问题。在第一个模型中,我们首先在大型数据集上与退化分类员学习特征说明;然后在与MOS附加说明的小型数据集上同时进行MOS预测和退化分类。在第二个方法中,初始阶段包括学习特征,在大型数据集上采用以深度集群为基础的、不受监督的特征说明。接下来,我们在小型数据集上同时进行MOS预测和分类标签分类。结果显示,基于深度集群的模型超越了基于降解分类的模型和TCD-VoIP的3个基线(自动电解码特性、P.563和SRMRMRnorm)。本文指出,多任务了解与未加标签的数据的特征说明相结合,是处理大型MOS附加说明数据集缺乏问题的有希望的办法。