Significant research efforts are currently being dedicated to non-intrusive quality and intelligibility assessment, especially given how it enables curation of large scale datasets of in-the-wild speech data. However, with the increasing capabilities of generative models to synthesize high quality speech, new types of artifacts become relevant, such as generative hallucinations. While intrusive metrics are able to spot such sort of discrepancies from a reference signal, it is not clear how current non-intrusive methods react to high-quality phoneme confusions or, more extremely, gibberish speech. In this paper we explore how to factor in this aspect under a fully unsupervised setting by leveraging language models. Additionally, we publish a dataset of high-quality synthesized gibberish speech for further development of measures to assess implausible sentences in spoken language, alongside code for calculating scores from a variety of speech language models.
翻译:当前,大量研究工作正致力于非侵入式的语音质量和可懂度评估,特别是考虑到这种方法能够促进大规模野外语音数据集的构建。然而,随着生成模型合成高质量语音的能力日益增强,新型的伪影问题变得愈发突出,例如生成式幻觉。虽然侵入式指标能够从参考信号中识别出此类差异,但目前尚不清楚现有的非侵入式方法如何应对高质量的语音混淆,或者更极端地,胡言乱语的语音。在本文中,我们探讨了如何通过利用语言模型,在完全无监督的设置下将这一方面纳入考量。此外,我们发布了一个高质量合成胡言乱语语音的数据集,以促进评估口语中不合理句子的度量方法的进一步发展,同时提供了从多种语音语言模型计算得分的代码。