We consider the problem of generating musical soundtracks in sync with rhythmic visual cues. Most existing works rely on pre-defined music representations, leading to the incompetence of generative flexibility and complexity. Other methods directly generating video-conditioned waveforms suffer from limited scenarios, short lengths, and unstable generation quality. To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms. Specifically, our framework consists of a latent conditional diffusion probabilistic model to perform waveform synthesis. Furthermore, a series of context-aware conditioning encoders are proposed to take temporal information into consideration for a long-term generation. Notably, we extend our model's applicability from dances to multiple sports scenarios such as floor exercise and figure skating. To perform comprehensive evaluations, we establish a benchmark for rhythmic video soundtracks including the pre-processed dataset, improved evaluation metrics, and robust generative baselines. Extensive experiments show that our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence. Codes are available at \url{https://github.com/OpenGVLab/LORIS}.
翻译:我们考虑如何在与节奏视觉线索同步的情况下生成音乐音轨。现有的大多数方法依赖于预定义的音乐表示,这导致生成的灵活性和复杂性不足。其他方法直接生成视频条件下的波形,但受限于应用场景、长度较短和不稳定的生成质量等问题。因此,我们提出了一种新的框架:长期节奏视频音轨生成器(LORIS),用于合成长期条件波形。具体而言,我们的框架包括一个潜变量条件扩散概率模型来执行波形合成。此外,我们还提出了一系列上下文感知的编码器,以考虑长期生成的时间信息。值得注意的是,我们将模型的适用范围从舞蹈扩展到多个体育场景,例如体操和花样滑冰。为了进行全面的评估,我们建立了一个节奏视频音轨基准,包括预处理数据集、改进的评估指标和稳健的生成基线。广泛的实验表明,我们的模型以最先进的音乐质量和节奏相应性生成长期音轨。源代码可在\url {https://github.com/OpenGVLab/LORIS}上找到。