Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score.
翻译:现代视频总结方法以深层神经网络为基础,需要大量附加说明的培训数据;然而,现有的视频总结数据集是小规模的,很容易导致深层模型的过度配置;考虑到大规模数据集的批注耗时,我们提议采用多式自我监督学习框架,以获取视频的语义表达方式,这有利于视频总结任务;具体地说,自我监督学习的方式是探索视频和文本在粗糙和精细格式上的语义一致性,以及视频中恢复的遮蔽框架;多式框架在新收集的数据集(包括视频文本配对)上接受培训;此外,我们采用渐进式视频汇总方法,逐步确定视频中的重要内容,以产生更好的摘要;广泛实验证明了我们在级相关系数和F-核心中的方法的有效性和优越性。