Many audio processing tasks require perceptual assessment. However, the time and expense of obtaining ``gold standard'' human judgments limit the availability of such data. Most applications incorporate full reference or other similarity-based metrics (e.g. PESQ) that depend on a clean reference. Researchers have relied on such metrics to evaluate and compare various proposed methods, often concluding that small, measured differences imply one is more effective than another. This paper demonstrates several practical scenarios where similarity metrics fail to agree with human perception, because they: (1) vary with clean references; (2) rely on attributes that humans factor out when considering quality, and (3) are sensitive to imperceptible signal level differences. In those scenarios, we show that no-reference metrics do not suffer from such shortcomings and correlate better with human perception. We conclude therefore that similarity serves as an unreliable proxy for audio quality.
翻译:许多音频处理任务需要感知评估。然而,获得“黄金标准”人类判断的时间和费用限制了这些数据的可用性。大多数应用都包含完全参考或依赖清洁参考的基于相似度的衡量标准(例如PESQ),研究人员依靠这些衡量标准来评价和比较各种拟议方法,往往认为小的、计量的差异意味着一种方法比另一种方法更有效。本文件展示了一些实际情景,其中相似度指标不能与人的看法一致,因为它们:(1) 与清洁参考标准不同;(2) 在考虑质量时依赖人类因素的属性;(3) 敏感地注意无法察觉的信号水平差异。在这些情形中,我们表明,不参考度指标没有受到这种缺陷的影响,而且与人的看法更相关。因此,我们的结论是,类似性作为声音质量的不可靠的替代物。