Generic Boundary Detection (GBD) aims at locating the general boundaries that divide videos into semantically coherent and taxonomy-free units, and could serve as an important pre-processing step for long-form video understanding. Previous works often separately handle these different types of generic boundaries with specific designs of deep networks from simple CNN to LSTM. Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs. The core design is to introduce a small set of latent feature queries as anchors to compress the redundant video input into a fixed dimension via cross-attention blocks. Thanks to this fixed number of latent units, it greatly reduces the quadratic complexity of attention operation to a linear form of input frames. Specifically, to explicitly leverage the temporal structure of videos, we construct two types of latent feature queries: boundary queries and context queries, which handle the semantic incoherence and coherence accordingly. Moreover, to guide the learning of latent feature queries, we propose an alignment loss on the cross-attention maps to explicitly encourage the boundary queries to attend on the top boundary candidates. Finally, we present a sparse detection head on the compressed representation, and directly output the final boundary detection results without any post-processing module. We test our Temporal Perceiver on a variety of GBD benchmarks. Our method obtains the state-of-the-art results on all benchmarks with RGB single-stream features: SoccerNet-v2 (81.9% avg-mAP), Kinetics-GEBD (86.0% avg-f1), TAPOS (73.2% avg-f1), MovieScenes (51.9% AP and 53.1% Miou) and MovieNet (53.3% AP and 53.2% Miou), demonstrating the generalization ability of our Temporal Perceiver.
翻译:常规边界探测(GBD) 旨在定位将视频分为音义一致和无分类的通用边界,并可作为远程视频理解的一个重要预处理步骤。以往的工作往往分别处理这些不同类型的通用边界,其设计是简单的CNN至LSTM的深网络。相反,我们在本文件中展示了一个带有变异器的一般结构Temoral Perceiver,它为探测任意的通用边界提供了统一的解决办法,从射击级别、事件级别到现场基准。核心设计是引入一套小型的隐性特征查询,作为将多余的视频输入通过交叉关注区压缩成固定的维度的锚点。由于这些固定数量的隐性单位,它极大地将关注操作的四面形复杂性降低到一个线性输入框架。具体地,为了明确利用视频的时间结构,我们构建了两种潜在的特征查询:边界查询和背景查询,它处理单面值的分辨率和一致性。此外,为了指导关于潜在地貌的查询,我们最终的轨迹调查结果Stencious-DOral-D结果,我们提议在最后的边界探测中进行一个直方位的测试。