Modern video summarization methods are based on deep neural networks which require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of large-scale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, we explore the semantic consistency between the visual information and text information of videos, for the self-supervised pretraining of a multimodal encoder on a newly-collected dataset of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Finally, an objective evaluation framework is proposed to measure the quality of video summaries based on video classification. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients, F-score, and the proposed objective evaluation compared to the state of the art.
翻译:现代视频总结方法以深层神经网络为基础,需要大量附加说明的培训数据;然而,现有的视频总结数据集规模小,容易导致深层模型的过度配置;考虑到大规模数据集的批注耗时,我们提议一个多式自我监督学习框架,以获取视频的语义表述,这有利于视频总结任务;具体地说,我们探索视频信息与视频文本信息之间的语义一致性,以便自行监督新收集的视频文本数据集的多式编码预培训;此外,我们采用渐进式视频汇总方法,逐步确定视频中的重要内容,以产生更好的摘要;最后,建议一个客观评价框架,以衡量视频分类基础上的视频摘要质量;广泛的实验证明,我们的方法在级相关系数(F-级)方面是有效和优越的;与艺术状况相比,拟议的客观评价是有效的。