基于物理驱动的扩散模型用于从视频中合成冲击声 (Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos)

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly.

翻译：模拟物体间相互作用发出的声音对于实际世界和虚拟世界中的身临其境的感知体验至关重要。传统的冲击声合成方法使用物理模拟获得能够表示和合成声音的一组物理参数。然而，它们需要物体几何细节和冲击位置的细节，而这些在现实世界中很少可以获得，并且无法应用于从常见视频中合成冲击声。另一方面，现有的基于深度学习的视频驱动方法仅能捕获视觉内容和冲击声之间的微弱对应关系，因为它们缺乏物理知识。在本文中，我们提出了一种物理驱动的扩散模型，可以为静音视频剪辑合成高保真度的冲击声。除了视频内容，我们还建议使用额外的物理先验知识来指导冲击声的合成过程。物理先验包括直接从嘈杂的真实冲击声样例中估计的物理参数以及通过神经网络解释声音环境的学习残余参数。我们进一步实现了一种具有特定训练和推理策略的新型扩散模型，以结合物理先验和视觉信息进行冲击声合成。实验结果表明，我们的模型在生成逼真的冲击声方面优于几个现有系统。更重要的是，基于物理的表示是完全可解释和透明的，因此我们能够灵活地进行声音编辑。