Standard evaluation metrics such as the Inception score and Fr\'echet Audio Distance provide a general audio quality distance metric between the synthesized audio and reference clean audio. However, the sensitivity of these metrics to variations in the statistical parameters that define an audio texture is not well studied. In this work, we provide a systematic study of the sensitivity of some of the existing audio quality evaluation metrics to parameter variations in audio textures. Furthermore, we also study three more potentially parameter-sensitive metrics for audio texture synthesis, (a) a Gram matrix based distance, (b) an Accumulated Gram metric using a summarized version of the Gram matrices, and (c) a cochlear-model based statistical features metric. These metrics use deep features that summarize the statistics of any given audio texture, thus being inherently sensitive to variations in the statistical parameters that define an audio texture. We study and evaluate the sensitivity of existing standard metrics as well as Gram matrix and cochlear-model based metrics to control-parameter variations in audio textures across a wide range of texture and parameter types, and validate with subjective evaluation. We find that each of the metrics is sensitive to different sets of texture-parameter types. This is the first step towards investigating objective metrics for assessing parameter sensitivity in audio textures.
翻译:此外,我们还研究三种可能具有参数敏感性的计量标准,如感知评分和Fr\'echet音频距离等标准评价指标,为合成音频和参考清洁音频提供了一般音质质量距离测量标准。然而,这些计量标准对于确定音频质的统计参数的变化的敏感性研究不够充分。在这项工作中,我们对一些现有音频质评估指标对于音频质素变化参数的变化的敏感性进行系统研究。此外,我们还研究三种可能具有参数敏感性的音频质合成指标,(a) 基于格拉姆矩阵的距离,(b) 使用Gram矩阵摘要版本的累计格拉姆指标,(c) 基于统计特征的cochlear模型。这些计量标准使用了深度特征,这些特征概述了任何特定音频质的统计,因此对界定音质素的统计参数的变化具有内在敏感性。我们研究并评估了现有标准计量标准以及基于格拉姆矩阵和cochlear模型的敏感性。我们发现,每种音频质质参数类型中的度测量是用于主观敏感度调查的。