We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the \emph{sound} of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
翻译:我们引入了新视觉声学合成(NVAS)任务:鉴于在源视图中观测到的视觉和声音,我们能否从看不见的目标角度合成该场景的 emph{sound}?我们建议一种神经合成方法:视觉辅助声学合成(ViGAS) 网络,通过分析输入的视听提示,学习合成空间任意点的声音。为了衡量这一任务,我们收集了两个首创的大型多视角视听数据集,一个合成数据集和一个真实数据集。我们展示了我们关于空间提示的成功理由,并合成了两个数据集的忠实音频。据我们所知,这项工作代表了解决新视角声学合成任务的第一个构件、数据集和办法,它具有从AR/VR到艺术和设计等令人兴奋的潜在应用。通过这项工作,我们认为新视角合成的未来是从视频中多式学习的。