基于物理驱动的扩散模型用于从视频中合成冲击声音 (Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos)

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.

翻译：模拟物理对象交互所发出的声音对于实际和虚拟世界中的沉浸感知体验至关重要。传统的冲击声音合成方法使用物理模拟来获得一组可表示和合成声音的物理参数。然而，它们需要精细的物体几何和冲击位置细节，这在现实世界中很少可用，并且无法用于从常见视频中合成冲击声音。另一方面，现有的视频驱动深度学习方法只能捕捉视觉内容和冲击声音之间的弱相关性，因为它们缺乏物理知识。在这项工作中，我们提出了一种物理驱动的扩散模型，可以为无声视频剪辑合成高保真的冲击声音。除了视频内容外，我们还提出使用附加物理先验来指导冲击声音合成过程。物理先验包括直接从有噪声的真实冲击声音示例中估计而来的物理参数，无需复杂的设置，并且学习残差参数通过神经网络解释声音环境。我们进一步实现了一种新颖的扩散模型，具有特定的训练和推断策略，以将物理先验和视觉信息相结合用于冲击声音合成。实验结果表明，我们的模型在生成逼真的冲击声方面优于现有的几个系统。更重要的是，基于物理的表示是完全可解释和透明的，因此我们可以灵活地进行声音编辑。