Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably leads to the model dominated by one of the two tasks in the training phase. In this paper, from an alternate perspective to overcome the above challenges, we unite these two tasks into one task by a new form of predicting shots link: a link connects two adjacent shots, indicating that they belong to the same scene or category. To the end, we propose a general One Stage Multimodal Sequential Link Framework (OS-MSL) to both distinguish and leverage the two-fold semantics by reforming the two learning tasks into a unified one. Furthermore, we tailor a specific module called DiffCorrNet to explicitly extract the information of differences and correlations among shots. Extensive experiments on a brand-new large scale dataset collected from real-world applications, and MovieScenes are conducted. Both the results demonstrate the effectiveness of our proposed method against strong baselines.
翻译:视频结构分析(SSC)是向视频结构分析领域迈出的关键一步。 诚然, 共同学习这两个任务可以通过共享共同信息来相互促进。 然而, 场面分解更多地关注相邻镜头之间的局部差异, 而分类则需要全球显示场景区段, 这可能导致在培训阶段以两种任务之一为主的模式。 在本文中, 从另一个角度来克服上述挑战, 我们用一种新的预测镜头链接形式将这两项任务合并成一项任务: 连接两个相邻镜头, 表明它们属于同一场景或类别。 最后, 我们提议了一个通用的“ 一个阶段多式序列链接框架 ” ( OS- MSL ), 以区分并利用两重的语义, 将两个学习任务改成一个统一的。 此外, 我们设计了一个名为 DiffCorrNet 的具体模块, 以明确提取不同镜头和相关性的信息。 在从真实世界应用中收集的新规模的大型数据集上进行广泛的实验, 以及MovieSceneress, 这两项结果都展示了我们所提议的方法相对于强势基准的有效性。