The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. Project site: https://xypb.github.io/CondFoleyGen/
翻译:设计师为视频添加的音效旨在传达特定的艺术效果,因此可能与场景的真实声音非常不同。受到创建与真实声音不同但仍然与屏幕上发生的动作相匹配的视频配乐的挑战的启发,我们提出了条件Foley的问题。我们提出了以下用于解决此问题的贡献。首先,我们提出了一个预文本任务,用于训练我们的模型使用在同一源视频中从另一个时间采样的条件音频视觉剪辑来预测输入视频剪辑的声音。其次,我们提出了一种模型,用于在给定用户提供的示例的情况下为无声输入视频生成配乐,以指定视频应该“听起来像什么”。我们通过人类研究和自动化评估指标显示,我们的模型成功地从视频生成声音,同时根据提供的示例内容变化。项目网站:https://xypb.github.io/CondFoleyGen/