The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
翻译:区分不同电影场景的能力对于理解电影的故事线至关重要。 但是, 精确地探测电影场景往往具有挑战性, 因为它要求能够对非常长的电影片段进行理解。 这与大多数现有的视频识别模型形成鲜明对比, 这些视频识别模型通常是为短程视频分析设计的。 这项工作提出了一种州- 空间变换模型, 可以有效捕捉长片片中的依赖性, 以便准确探测电影场景。 我们的模型, 被称为TranS4mer, 使用一个新的 S4A 建筑块构建, 它将结构化的州- 空间序列( S4) 和 自我注意( A) 层的强项结合起来。 由于将S4A 框架分为电影片段( 不中断的时段, 相机的位置不会改变), S4A 区块首先使用自控模型来捕捉短程的室内依赖性。 之后, S4A区的州- 空间操作将被用于汇总长程模型。 最终的 TranS4mer模型, 可以经过培训的端端端端端到端。 通过将 S4A 框架分为电影镜头( 不中断的时段段段) 而不是前的G2号, 。