Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model's existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization -- without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).
翻译:Foley Control是一种轻量级的视频引导拟音方法,它保持预训练的单模态模型冻结,仅学习它们之间的小型交叉注意力桥接。我们将V-JEPA2视频嵌入连接到冻结的Stable Audio Open DiT文本到音频(T2A)模型,方法是在模型现有的文本交叉注意力之后插入紧凑的视频交叉注意力,从而让提示词设定全局语义,而视频则细化时序和局部动态。冻结的主干网络保留了强大的边缘分布(视频;给定文本的音频),而桥接模块则学习同步所需的音频-视频依赖关系——无需重新训练音频先验。为了减少内存消耗并稳定训练,我们在条件化之前对视频标记进行池化。在精选的视频-音频基准测试中,Foley Control以远少于近期多模态系统的可训练参数,实现了具有竞争力的时序和语义对齐效果,同时保留了提示驱动的可控性以及适合实际生产的模块化特性(无需端到端重新训练即可交换/升级编码器或T2A主干)。尽管我们专注于视频到拟音任务,但相同的桥接设计有望扩展到其他音频模态(例如语音)。