Self-supervised learning has drawn attention through its effectiveness in learning in-domain representations with no ground-truth annotations; in particular, it is shown that properly designed pretext tasks (e.g., contrastive prediction task) bring significant performance gains for downstream tasks (e.g., classification task). Inspired from this, we tackle video scene segmentation, which is a task of temporally localizing scene boundaries in a video, with a self-supervised learning framework where we mainly focus on designing effective pretext tasks. In our framework, we discover a pseudo-boundary from a sequence of shots by splitting it into two continuous, non-overlapping sub-sequences and leverage the pseudo-boundary to facilitate the pre-training. Based on this, we introduce three novel boundary-aware pretext tasks: 1) Shot-Scene Matching (SSM), 2) Contextual Group Matching (CGM) and 3) Pseudo-boundary Prediction (PP); SSM and CGM guide the model to maximize intra-scene similarity and inter-scene discrimination while PP encourages the model to identify transitional moments. Through comprehensive analysis, we empirically show that pre-training and transferring contextual representation are both critical to improving the video scene segmentation performance. Lastly, we achieve the new state-of-the-art on the MovieNet-SSeg benchmark. The code is available at https://github.com/kakaobrain/bassl.
翻译:自我监督的学习通过在学习主场演示时的实效而引起人们的注意,而没有地面真实性说明;特别是,我们发现,设计得当的托辞任务(例如对比式预测任务)为下游任务(例如分类任务)带来显著的业绩收益。受此启发,我们处理视频现场分割问题,这是在视频中暂时确定现场边界的任务,一个自我监督的学习框架,我们主要侧重于设计有效的托辞任务。在我们的框架内,我们从一系列镜头中发现了一种假的防线,将它分为两个连续的、不重叠的子序列,并利用假边界任务促进培训前的工作。在此基础上,我们引入了三种新的边界识别托辞任务:1) Shot-Scen Match(SSSM),2) 背景小组匹配(CGM)和3) Pseudo-bormunicalimationion(PP); SSM和CGM指导模型,以最大限度地实现内部相似性和跨线歧视,同时,PP鼓励将模型用于背景分析。最后的模型显示实地分析阶段。