Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
翻译:近期视频到音频生成技术已在感知质量与时间同步性方面取得显著进展,但多数模型仍以外观驱动为主,仅捕捉视觉-声学相关性而未考虑塑造真实世界声音的物理因素。本文提出物理感知视频到音频合成方法,通过物理驱动音频适配器将物理推理融入基于潜在扩散的V2A生成框架。该适配器接收物理参数估计器生成的对象级物理参数:其中视觉语言模型用于推断运动物体质量,基于分割的动态三维重建模块则恢复其运动轨迹以计算速度。这些物理线索使模型能合成反映底层物理因素的声音。为评估物理真实性,我们构建了专注于物体间交互的VGG-Impact基准数据集,并提出音频-物理相关系数——用于衡量物理属性与听觉特征一致性的评估指标。综合实验表明,PAVAS能生成物理合理且感知连贯的音频,在定量与定性评估中均优于现有V2A模型。演示视频请访问:https://physics-aware-video-to-audio-synthesis.github.io