Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one. Most of the existing video summarization approaches focus on hand-crafted labels. As the number of videos grows exponentially, there emerges an increasing need for methods that can learn meaningful summarizations without labeled annotations. In this paper, we aim to maximally exploit unsupervised video summarization while concentrating the supervision to a few, personalized labels as an add-on. To do so, we formulate the key requirements for the informative video summarization. Then, we propose contrastive learning as the answer to both questions. To further boost Contrastive video Summarization (CSUM), we propose to contrast top-k features instead of a mean video feature as employed by the existing method, which we implement with a differentiable top-k feature selector. Our experiments on several benchmarks demonstrate, that our approach allows for meaningful and diverse summaries when no labeled data is provided.
翻译:视频摘要的目的是选择视频的部分,尽可能接近于原始内容来叙述故事。现有的视频摘要方法大多专注于手工标注。随着视频数量呈指数级增长,出现了一种越来越需要的方法,即无需标注注释即可学习有意义的摘要。在本文中,我们旨在最大限度地利用无监督视频摘要,同时将监督集中在少数个人化标签上作为补充。为此,我们制定了信息化视频摘要的关键要求。然后,我们将对比学习作为两个问题的答案。为进一步提升对比视频摘要(CSUM),我们建议对比前k个特征,而不是现有方法使用的平均视频特征,我们使用可微分的前k个特征选择器来实现。我们在几个基准测试中的实验表明,我们的方法允许无需提供标注数据即可生成意义丰富、多样的摘要。