Despite its wide range of applications, video summarization is still held back by the scarcity of extensive datasets, largely due to the labor-intensive and costly nature of frame-level annotations. As a result, existing video summarization methods are prone to overfitting. To mitigate this challenge, we propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification. Empirical evaluations on correlation-based metrics, such as Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.
翻译:尽管视频摘要具有广泛的应用,但由于帧级注释费时费力,导致数据集仍然稀缺。因此,现有的视频摘要方法容易产生过拟合。为了解决这个问题,我们提出了一种新的自主式视频表示学习方法,使用知识蒸馏来预训练Transformer编码器。我们的方法将基于帧重要性得分构建的语义视频表示与在视频分类之上训练的CNN推导出的表示匹配。基于皮尔逊相关系数等基于相关性的指标的实证评估表明,相对于现有的最先进方法,我们的方法在为输入帧分配相对得分方面具有优越性。