Scenes play a crucial role in breaking the storyline of movies and TV episodes into semantically cohesive parts. However, given their complex temporal structure, finding scene boundaries can be a challenging task requiring large amounts of labeled training data. To address this challenge, we present a self-supervised shot contrastive learning approach (ShotCoL) to learn a shot representation that maximizes the similarity between nearby shots compared to randomly selected shots. We show how to apply our learned shot representation for the task of scene boundary detection to offer state-of-the-art performance on the MovieNet dataset while requiring only ~25% of the training labels, using 9x fewer model parameters and offering 7x faster runtime. To assess the effectiveness of ShotCoL on novel applications of scene boundary detection, we take on the problem of finding timestamps in movies and TV episodes where video-ads can be inserted while offering a minimally disruptive viewing experience. To this end, we collected a new dataset called AdCuepoints with 3,975 movies and TV episodes, 2.2 million shots and 19,119 minimally disruptive ad cue-point labels. We present a thorough empirical analysis on this dataset demonstrating the effectiveness of ShotCoL for ad cue-points detection.
翻译:场景在打破电影和电视片段的故事线,将电影和电视片段破解成具有内在凝聚力的部分方面发挥着关键作用。 但是,鉴于其复杂的时间结构,发现现场边界可能是一项挑战性的任务,需要大量标签培训数据。 为了应对这一挑战,我们展示了一种自我监督的拍摄对比学习方法(ShotCol),以学习一个镜头展示,使附近镜头与随机选择的镜头的相似性最大化。我们展示了如何运用我们所学的现场边界探测任务拍摄镜头演示,在电影网数据集中提供最先进的性能,而仅需要~25%的培训标签,使用9x更少的模型参数,提供7x更快的运行时间。为了评估ShotCol在现场边界探测新应用中的有效性,我们面对了在电影和电视片段中找到时间戳的问题,在其中可以插入视频片段,同时提供最起码的破坏性的观察经验。 为此,我们收集了一套称为AdCuepoint的新数据集,用3,975个电影和电视片段提供最先进的性表现,220万个镜头,19119个最低破坏性的图像标记点,用以展示目前彻底的实验性标签。