With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by applying a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of the video by mixing two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin, obtaining a relative improvement of 34.8% over the previous best model in terms of R@1.
翻译:随着短视频的增加,选择适当的背景音乐(BGM)制作视频的需求大大增加了,视频音乐检索(VMR)任务逐渐引起研究界的极大关注。作为其他跨模式学习任务,现有的VMR做法通常试图测量功能空间视频和音乐的相似性。然而,这些做法:(1) 忽视了不可避免的标签噪音;(2) 忽视了加强获取关键视频剪辑的能力。在本文中,我们提议了一个新型的基于背景的自我培训框架,称为SSVMR。具体地说,我们首先探索充分利用培训数据集中包含的信息,采用半监督方法来抑制标签噪音问题的不利影响,即采用自我培训方法。此外,我们提议通过将两部视频混合在平线上并保存原始视频的位置来捕捉视频的显著性。在NLP的背译下,我们还进行回检索,以获取更多的培训数据。MVD数据集的实验结果显示,我们的SSVMMRMRMM在34级的相对比例上取得了最佳的模型。