For immersive applications, the generation of binaural sound that matches the visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent works have shown the possibility to use neural networks for synthesizing binaural audio from mono audio using 2D visual information as guidance. Extending this approach by guiding the audio using 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. In this paper, we present Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network with 3D sparse convolutions which extracts visual features from the point cloud scene to condition an audio network, which operates in the waveform domain, to synthesize the binaural version. Experimental results indicate that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. In addition, we investigate different loss functions and 3D point cloud attributes, showing that directly predicting the full binaural signal and using rgb-depth features increases the performance of our proposed model.
翻译:对于隐性应用,生成与视觉对等相匹配的双声传声对于在虚拟环境中为人们带来有意义的经验至关重要。最近的工作表明,使用 2D 视觉信息作为指导,利用单声带将单声带的双声带合成神经网络是可能的。通过使用 3D 视觉信息指导音频并在波形域内操作来扩展这一方法,可以更准确地将虚拟音频场进行分化。在本文中,我们展示了Ppoint2Sound,一个多式深度学习模型,利用 3D 点云场景从单声带生成一个双声带版本。具体地说,Ppoint2Sound由3D 分散的相光谱网络组成,从点云场提取视觉特征,以设置在波形域运行的音频网络,合成双声带版本。实验结果显示, 3D 视觉信息能够成功指导多式深度学习模型,完成双声合成任务。此外,我们调查了不同的损失功能和3D点云谱属性,显示直接预测整个阵形信号,并使用Rgb 深度功能提高拟议性能。