Subjective evaluations are critical for assessing the perceptual realism of sounds in audio-synthesis driven technologies like augmented and virtual reality. However, they are challenging to set up, fatiguing for users, and expensive. In this work, we tackle the problem of capturing the perceptual characteristics of localizing sounds. Specifically, we propose a framework for building a general purpose quality metric to assess spatial localization differences between two binaural recordings. We model localization similarity by utilizing activation-level distances from deep networks trained for direction of arrival (DOA) estimation. Our proposed metric (DPLM) outperforms baseline metrics on correlation with subjective ratings on a diverse set of datasets, even without the benefit of any human-labeled training data.
翻译:主观评价对于评估音频合成技术中声音的认知现实性至关重要,例如扩大和虚拟现实。然而,它们对于建立、为用户提供肥料和昂贵的难度很大。在这项工作中,我们解决了捕捉声音本地化概念特征的问题。具体地说,我们提议了一个框架,用于建立一个通用质量衡量标准,以评估两个二进制录音之间的空间定位差异。我们通过利用从受过训练的深度网络到来估计方向的启动距离来模拟本地化相似。我们提议的衡量标准(DPLM)比起一套不同数据集的主观评级,甚至没有任何人类标记的培训数据。