Current video-to-audio (V2A) methods struggle in complex multi-event scenarios (video scenarios involving multiple sound sources, sound events, or transitions) due to two critical limitations. First, existing methods face challenges in precisely aligning intricate semantic information together with rapid dynamic features. Second, foundational training lacks quantitative preference optimization for semantic-temporal alignment and audio quality. As a result, it fails to enhance integrated generation quality in cluttered multi-event scenes. To address these core limitations, this study proposes a novel V2A framework: MultiSoundGen. It introduces direct preference optimization (DPO) into the V2A domain, leveraging audio-visual pretraining (AVP) to enhance performance in complex multi-event scenarios. Our contributions include two key innovations: the first is SlowFast Contrastive AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture. SF-CAVP explicitly aligns core semantic representations and rapid dynamic features of audio-visual data to handle multi-event complexity; second, we integrate the DPO method into V2A task and propose AVP-Ranked Preference Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and prioritize critical semantic-temporal matches while enhancing audio quality. Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA) performance in multi-event scenarios, delivering comprehensive gains across distribution matching, audio quality, semantic alignment, and temporal synchronization. Demos are available at https://v2aresearch.github.io/MultiSoundGen/.
翻译:当前视频到音频(V2A)方法在复杂多事件场景(涉及多个声源、声音事件或过渡的视频场景)中面临困难,主要源于两个关键局限。首先,现有方法难以精确对齐复杂的语义信息与快速动态特征。其次,基础训练缺乏针对语义-时序对齐和音频质量的定量偏好优化,导致无法提升杂乱多事件场景中的综合生成质量。为解决这些核心问题,本研究提出了一种新颖的V2A框架:MultiSoundGen。该框架将直接偏好优化(DPO)引入V2A领域,并利用视听预训练(AVP)来增强复杂多事件场景下的性能。我们的贡献包括两项关键创新:其一是SlowFast对比视听预训练(SF-CAVP),这是一种具有统一双流架构的开创性AVP模型。SF-CAVP显式对齐视听数据的核心语义表示与快速动态特征,以处理多事件复杂性;其二是将DPO方法整合到V2A任务中,并提出AVP排序偏好优化(AVP-RPO)。该方法以SF-CAVP作为奖励模型,量化并优先处理关键的语义-时序匹配,同时提升音频质量。实验表明,MultiSoundGen在多事件场景中实现了最先进的性能,在分布匹配、音频质量、语义对齐和时序同步方面均获得全面提升。演示可访问 https://v2aresearch.github.io/MultiSoundGen/。