Generic Event Boundary Detection (GEBD) aims to detect moments where humans naturally perceive as event boundaries. In this paper, we present Structured Context Transformer (or SC-Transformer) to solve the GEBD task, which can be trained in an end-to-end fashion. Specifically, we use the backbone convolutional neural network (CNN) to extract the features of each video frame. To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence. Note that, the overall computation complexity of SC-Transformer is linear to the video length. After that, the group similarities are computed to capture the differences between frames. Then, a lightweight fully convolutional network is used to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, the Gaussian kernel is adopted to preprocess the ground-truth event boundaries to further boost the accuracy. Extensive experiments conducted on the challenging Kinetics-GEBD and TAPOS datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods.
翻译:常规一般事件边界探测( GEBD) 旨在探测人类自然认为事件界限的时刻。 在本文中, 我们展示结构化环境变异器( 或 SC- Transfer) 以完成 GEBD 任务, 可以通过端到端方式进行培训。 具体地说, 我们使用主干神经神经元网络( CNN) 提取每个视频框架的特征。 为了捕捉每个框架的时间背景信息, 我们通过重新分割输入框架序列来设计结构背景变异器( SC- Transformed) 。 注意, SC- Transext 的总体计算复杂性是线性到视频长度。 在此之后, 将计算组的相似性来捕捉框架之间的差异。 然后, 使用轻量的全变动网络来根据分组相似图确定事件界限。 为了补救边界说明的模糊性, 高斯内核内核将预处理地面事件边界以进一步提高准确性。 在具有挑战性的 Einets- GEGBD 和 TAPOS 数据集上进行的广泛实验, 展示拟议方法与状态方法的实效。