Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video. The advantage of unsupervised approaches is that they do not require human annotations to learn the summarization capability and generalize to a wider range of domains. Previous work relies on the same type of deep features, typically based on a model pre-trained on ImageNet data. Therefore, we propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content. For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches. Two of these approaches were implemented by ourselves to reproduce the reported results. Our evaluation shows that we obtain state-of-the-art results on both datasets, while also highlighting the shortcomings of previous work with regard to the evaluation methodology. Finally, we perform error analysis on videos for the two benchmark datasets to summarize and spot the factors that lead to misclassifications.
翻译:视频摘要旨在生成一个传达原始视频精髓的简明但有代表性的视觉摘要。未受监督的方法的优点是,这些方法不需要人手说明来学习汇总能力,而不需要向更广泛的领域加以概括。以前的工作依靠的是同一类型的深度特征,通常基于对图像网络数据进行预先培训的模型。因此,我们提议纳入多个具有块和斜体的特征源,以提供更多关于视觉内容的信息。关于两个基准TVSum和Sume的全面评价,我们比较我们的方法与四种最先进的方法。其中两种方法是由我们自己用来复制所报告的结果的。我们的评价显示,我们在这两个数据集上都获得了最先进的结果,同时也强调了先前在评价方法方面工作的缺点。最后,我们对两个基准数据集的视频进行了错误分析,以总结和发现导致分类错误的因素。