Recent years have seen considerable advances in audio synthesis with deep generative models. However, the state-of-the-art is very difficult to quantify; different studies often use different evaluation methodologies and different metrics when reporting results, making a direct comparison to other systems difficult if not impossible. Furthermore, the perceptual relevance and meaning of the reported metrics in most cases unknown, prohibiting any conclusive insights with respect to practical usability and audio quality. This paper presents a study that investigates state-of-the-art approaches side-by-side with (i) a set of previously proposed objective metrics for audio reconstruction, and with (ii) a listening study. The results indicate that currently used objective metrics are insufficient to describe the perceptual quality of current systems.
翻译:近些年来,在与深层基因模型的音频合成方面取得了长足的进步,然而,最先进的技术很难量化;不同的研究在报告结果时往往使用不同的评价方法和不同的衡量标准,使得与其他系统的直接比较即使并非不可能,也很难做到;此外,在大多数未知情况下,所报告的衡量标准的概念相关性和意义,禁止对实际可用性和音频质量作出任何结论性的洞察;本文件介绍了一项研究,对(一) 一套先前提出的声频重建客观衡量标准,以及(二) 一项监听研究,分别调查了最先进的方法;结果显示,目前使用的客观衡量标准不足以描述当前系统的概念质量。