We propose a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting. Animating the appearance of the visual effect is challenging because each frame of the edited video should have visual changes while maintaining temporal consistency. Moreover, existing video editing solutions focus on temporal consistency across frames, ignoring the visual style variations over time, e.g., thunderstorm, wave, fire crackling. To overcome this limitation, we utilize temporal sound features for the dynamic style. Specifically, we guide denoising diffusion probabilistic models with an audio latent representation in the audio-visual latent space. To the best of our knowledge, our work is the first to explore sound-guided natural video editing from various sound sources with sound-specialized properties, such as intensity, timbre, and volume. Additionally, we design optical flow-based guidance to generate temporally consistent video frames, capturing the pixel-wise relationship between adjacent frames. Experimental results show that our method outperforms existing video editing techniques, producing more realistic visual effects that reflect the properties of sound. Please visit our page: https://kuai-lab.github.io/soundini-gallery/.
翻译:我们提出了一种方法,用于在零样本设定下向视频的特定区域添加声音导向的视觉效果。动画化视觉效果的外观是具有挑战性的,因为在保持时间一致性的同时,编辑视频的每一帧都应具有视觉变化。此外,现有的视频编辑解决方案侧重于跨帧的时间一致性,而忽略了随时间变化的视觉风格变化,例如雷暴,波浪,火焰的爆裂。为了克服这种限制,我们利用时间声音特征来实现动态风格。具体而言,我们使用音-视空间中的音频潜在表示来指导噪声扩散概率模型。据我们所知,我们的工作是首次探索从具有声音专业属性(例如强度,音色和音量)的各种声源进行声音导向的自然视频编辑。此外,我们设计了基于光流的指导来生成时间一致的视频帧,捕获相邻帧之间的像素关系。实验结果表明,我们的方法优于现有的视频编辑技术,可以产生更逼真的视觉效果,反映了声音的特性。请访问我们的页面:https://kuai-lab.github.io/soundini-gallery/。