Audio quality assessment is critical for assessing the perceptual realism of sounds. However, the time and expense of obtaining ''gold standard'' human judgments limit the availability of such data. For AR&VR, good perceived sound quality and localizability of sources are among the key elements to ensure complete immersion of the user. Our work introduces SAQAM which uses a multi-task learning framework to assess listening quality (LQ) and spatialization quality (SQ) between any given pair of binaural signals without using any subjective data. We model LQ by training on a simulated dataset of triplet human judgments, and SQ by utilizing activation-level distances from networks trained for direction of arrival (DOA) estimation. We show that SAQAM correlates well with human responses across four diverse datasets. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. For example, simply replacing an existing loss with our metric yields improvement in a speech-enhancement network.
翻译:声音质量评估对于评估声音的认知现实性至关重要。 然而,获得“黄金标准”人类判断的时间和费用限制了这些数据的可用性。 对于AR&VR来说,人们所认为的良好质量和来源的可本地性是确保用户完全浸入的关键要素之一。我们的工作引入了SAQAM, 它使用多任务学习框架来评估听力质量(LQ)和空间化质量(SQ), 而不使用任何主观数据来评估任何一对给定的二进制信号。 我们通过培训模拟三重人类判断数据集来模拟LQ, 利用与经过培训的抵达方向(DOA)估计(DOA)网络的启动距离来模拟SQ。 我们显示,SAQAM与四个不同数据集的人类反应密切相关。 由于它是一个深层次的网络, 衡量标准是不同的, 因此它适合作为其他任务的损失函数。 例如, 仅仅用我们在语音加速网络中提高的量值来取代现有的损失。