Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues. This paper focuses on novel audio-visual scene synthesis in the real world. Given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that audio-visual scene. Directly using a NeRF-based model for audio synthesis is insufficient due to its lack of prior knowledge and acoustic supervision. To tackle the challenges, we first propose an acoustic-aware audio generation module that integrates our prior knowledge of audio propagation into NeRF, in which we associate audio generation with the 3D geometry of the visual environment. In addition, we propose a coordinate transformation module that expresses a viewing direction relative to the sound source. Such a direction transformation helps the model learn sound source-centric acoustic fields. Moreover, we utilize a head-related impulse response function to synthesize pseudo binaural audio for data augmentation that strengthens training. We qualitatively and quantitatively demonstrate the advantage of our model on real-world audio-visual scenes. We refer interested readers to view our video results for convincing comparisons.
翻译:人类对复杂世界的认识依赖于对多模式信号的全面分析,而音像信号的共发给人类提供了丰富的提示。 本文侧重于现实世界中新型视听场景合成。 视听场景的视频记录显示, 任务是在视听场景中将新视频与空间音频和空间音频合成并随该视听场景的任意小照摄像轨迹进行合成。 直接使用基于 NERF 的音频合成模型是不够的, 因为它缺乏先前的知识和声学监督。 为了应对挑战, 我们首先提议一个声觉音频生成模块, 将我们先前的音频传播知识整合到NERF, 在其中我们将音频生成与视觉环境的3D几何测量联系起来。 此外, 我们提出一个协调的转换模块, 表达与声音源相对的方向。 这种方向转换有助于模型学习声音源中心音域。 此外, 我们利用一个与头有关的脉冲响应功能, 合成假的双声波音频音频增强, 以加强培训。 我们从质量和数量上展示了我们模型在现实世界视听场景上的优势。