Neural audio codecs have gained recent popularity for their use in generative modeling as they offer high-fidelity audio reconstruction at low bitrates. While human listening studies remain the gold standard for assessing perceptual quality, they are time-consuming and impractical. In this work, we examine the reliability of existing objective quality metrics in assessing the performance of recent neural audio codecs. To this end, we conduct a MUSHRA listening test on high-fidelity speech signals and analyze the correlation between subjective scores and widely used objective metrics. Our results show that, while some metrics align well with human perception, others struggle to capture relevant distortions. Our findings provide practical guidance for selecting appropriate evaluation metrics when using neural audio codecs for speech.
翻译:神经音频编解码器因其在低比特率下提供高保真音频重建的能力,在生成建模中近期备受关注。尽管人类听觉研究仍是评估感知质量的金标准,但其耗时且不实用。本研究探讨了现有客观质量指标在评估近期神经音频编解码器性能时的可靠性。为此,我们对高保真语音信号进行了MUSHRA听觉测试,并分析了主观评分与广泛使用的客观指标之间的相关性。结果表明,虽然部分指标与人类感知高度一致,但其他指标难以捕捉相关失真。本研究为使用神经音频编解码器处理语音时选择合适的评估指标提供了实用指导。