Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one. Most of the existing video summarization approaches focus on hand-crafted labels. As the number of videos grows exponentially, there emerges an increasing need for methods that can learn meaningful summarizations without labeled annotations. In this paper, we aim to maximally exploit unsupervised video summarization while concentrating the supervision to a few, personalized labels as an add-on. To do so, we formulate the key requirements for the informative video summarization. Then, we propose contrastive learning as the answer to both questions. To further boost Contrastive video Summarization (CSUM), we propose to contrast top-k features instead of a mean video feature as employed by the existing method, which we implement with a differentiable top-k feature selector. Our experiments on several benchmarks demonstrate, that our approach allows for meaningful and diverse summaries when no labeled data is provided.
翻译:视频摘要旨在选择一个视频的部件, 尽可能把故事描述得接近原始的。 大多数现有的视频摘要方法都侧重于手工制作的标签。 随着视频数量成倍增长, 出现了一种越来越需要的方法来学习有意义的摘要, 而不贴标签的注释。 在本文中, 我们的目标是尽量利用未经监督的视频摘要, 同时把监督集中在少数个性化标签上, 将其作为附加内容。 为了做到这一点, 我们为信息化视频摘要制定了关键要求。 然后, 我们提出对比学习作为两个问题的答案。 为了进一步提升对比性视频摘要( CSUM), 我们建议对比现有方法所使用的顶部特征而不是平均视频特征, 我们用不同的顶部特征选择器执行。 我们在几个基准上的实验显示, 我们的方法允许在没有提供标签数据时进行有意义和多样的概要 。