A long-term video, such as a movie or TV show, is composed of various scenes, each of which represents a series of shots sharing the same semantic story. Spotting the correct scene boundary from the long-term video is a challenging task, since a model must understand the storyline of the video to figure out where a scene starts and ends. To this end, we propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from unlabeled long-term videos. More specifically, we present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability. Instead of explicitly learning the scene boundary features as in the previous methods, we introduce a vanilla temporal model with less inductive bias to verify the quality of the shot features. Our method achieves the state-of-the-art performance on the task of Video Scene Segmentation. Additionally, we suggest a more fair and reasonable benchmark to evaluate the performance of Video Scene Segmentation methods. The code is made available.
翻译:一个长期的视频,如电影或电视节目,由各种场景组成,每个场景代表一系列分享相同语义故事的镜头。从长期视频中点出正确的场景边界是一项艰巨的任务,因为模型必须理解视频的故事线,才能弄清场景的开始和结束之处。为此,我们提议一个有效的自我监督学习框架,从未贴标签的长期视频中学习更好的拍摄演示。更具体地说,我们提出了一个SSL计划,以实现场景的一致性,同时探索大量的数据增强和打乱方法,以提升模型的通用性。我们不象以往方法那样明确了解场景的边界特征,而是引入一个不那么带有刻意偏差的香草时间模型,以核实场景特征的质量。我们的方法达到了视频Scene Cluction 任务中最先进的表现。此外,我们建议一个更公平合理的基准来评估视频场景分界法的性。我们提供了代码。