Binaural sound that matches the visual counterpart is crucial to bring meaningful and immersive experiences to people in augmented reality (AR) and virtual reality (VR) applications. Recent works have shown the possibility to generate binaural audio from mono using 2D visual information as guidance. Using 3D visual information may allow for a more accurate representation of a virtual audio scene for VR/AR applications. This paper proposes Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consist of a vision network which extracts visual features from the point cloud scene to condition an audio network, which operates in the waveform domain, to synthesize the binaural version. Both quantitative and perceptual evaluations indicate that our proposed model is preferred over a reference case, based on a recent 2D mono-to-binaural model.
翻译:与视觉对等相匹配的边际声音对于在扩大现实和虚拟现实应用中给人们带来有意义和深入的经验至关重要。最近的工作表明有可能使用 2D 视觉信息从单体生成双声波音,作为指导。使用 3D 视觉信息可以更准确地显示VR/AR 应用程序的虚拟音频场景。本文建议了Point2Sound, 这是一种多式深层次学习模型, 利用 3D 点云层场景从单声波谱生成双声波版。 具体地说, Ppoint2Sound 包括一个视觉网络, 从点云场提取视觉特征, 使在波形域运行的音频网络得以合成双声波版。 定量和感知性评估都表明,基于最近的 2D 单体至 模型, 我们提议的模式优于参考案例。