Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on the image classification task, the metrics can already yield high-quality importance scores, demonstrating competitive or better performance than past heavily-trained methods. We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved, and the model can also leverage a large number of random videos and generalize to test videos with decent performance. Code available at https://github.com/pangzss/pytorch-CTVSUM.
翻译:视频总和的目的是在视频中选择信息最丰富的框架子集,以便于高效率的视频浏览。不受监督的方法通常依赖多样性和代表性等超常培训目标。然而,这些方法需要套牢在线生成的摘要,以计算重要评分回归的目标。我们认为这种管道效率低下,并力求直接量化框架层面的重要性,同时在代表性学习文献中的对比性损失方面起到帮助作用。利用对比性损失,我们提议三个衡量尺度,以可取的关键框架为特征:地方差异、全球一致性和独特性。在图像分类任务方面经过预先培训的特征,衡量尺度已经能够产生高质量的重要分数,显示有竞争力或比过去受过严格训练的方法更好的性能。我们表明,通过以轻量度对比性学习的预测模块改进培训前的特征,框架级重要性分数可以进一步改进,模型还可以利用大量随机视频和一般化来测试具有体面表现的视频。代码见https://github.com/pangzs/pytorch-CTVSUM。