We present an approach for recommending a music track for a given video, and vice versa, based on both their temporal alignment and their correspondence at an artistic level. We propose a self-supervised approach that learns this correspondence directly from data, without any need of human annotations. In order to capture the high-level concepts that are required to solve the task, we propose modeling the long-term temporal context of both the video and the music signals, using Transformer networks for each modality. Experiments show that this approach strongly outperforms alternatives that do not exploit the temporal context. The combination of our contributions improve retrieval accuracy up to 10x over prior state of the art. This strong improvement allows us to introduce a wide range of analyses and applications. For instance, we can condition music retrieval based on visually defined attributes.
翻译:我们提出一种方法来建议某一视频的音乐音轨,反之亦然,其依据是它们的时间吻合和艺术水平的对应。我们建议一种自我监督的方法,直接从数据中学习这种通信,而不需要任何人类的注释。为了捕捉解决任务所需的高级概念,我们建议对视频和音乐信号的长期时间背景进行建模,使用每个模式的变换器网络。实验表明,这一方法大大优于不利用时间背景的替代方法。我们的贡献结合在一起,提高了检索准确性,比以往的艺术状态高10x。这一强有力的改进使我们能够引入广泛的分析和应用。例如,我们可以根据视觉定义的属性来设置音乐检索条件。