Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
翻译:尽管近年来音频生成已得到广泛研究,但视频对齐的音频生成仍是一个相对未被充分探索的前沿领域。为填补这一空白,我们提出了StereoSync,这是一种新颖且高效的模型,旨在生成既与参考视频时间同步、又与其视觉内容空间对齐的音频。此外,StereoSync通过利用预训练的基础模型实现了高效性,在保持高质量合成的同时减少了对大量训练的需求。与现有主要关注时间同步的方法不同,StereoSync通过将空间感知融入视频对齐的音频生成中,实现了重要进展。具体而言,给定输入视频,我们的方法从深度图和边界框中提取空间线索,并将其用作基于扩散的音频生成模型中的交叉注意力条件。这种方法使StereoSync能够超越简单的同步,生成能够动态适应视频场景空间结构与运动的立体声音频。我们在Walking The Maps数据集上评估了StereoSync,该精选数据集包含来自电子游戏的视频,其中动画角色在多样化环境中行走。实验结果表明,StereoSync能够同时实现时间与空间的对齐,推动了视频到音频生成的技术前沿,并带来了显著更具沉浸感与真实性的听觉体验。