Human judgments obtained through Mean Opinion Scores (MOS) are the most reliable way to assess the quality of speech signals. However, several recent attempts to automatically estimate MOS using deep learning approaches lack robustness and generalization capabilities, limiting their use in real-world applications. In this work, we present a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal. Unlike prior works, our approach uses non-matching references as a form of conditioning to ground the MOS estimation by neural networks. We show that NORESQA-MOS provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though we use a smaller training set. Moreover, we also show that our generic framework can be combined with other learning methods such as self-supervised learning and can further supplement the benefits from these methods.
翻译:通过 " 平均意见评分 " (MOS)获得的人类判断是评估语言信号质量的最可靠方法。然而,最近几次试图利用深层次学习方法自动估计MOS的尝试缺乏稳健性和概括性,限制了其在现实世界应用中的应用。在这项工作中,我们提出了一个新颖的框架,即NORESQA-MOS,用于估计语音信号的MOS。与以前的工作不同,我们的方法使用不匹配的参考作为神经网络对MOS进行估算的一种调节形式。我们表明,NORESQA-MOS提供了比DNSMOS和NISQA等以前最先进的方法更好的概括性和更强的MOS估计,尽管我们使用了较小的培训组。此外,我们还表明,我们的通用框架可以与其他学习方法(如自我监督学习)相结合,并且可以进一步补充这些方法的好处。