Generic Event Boundary Detection (GEBD) is a newly suggested video understanding task that aims to find one level deeper semantic boundaries of events. Bridging the gap between natural human perception and video understanding, it has various potential applications, including interpretable and semantically valid video parsing. Still at an early development stage, existing GEBD solvers are simple extensions of relevant video understanding tasks, disregarding GEBD's distinctive characteristics. In this paper, we propose a novel framework for unsupervised/supervised GEBD, by using the Temporal Self-similarity Matrix (TSM) as the video representation. The new Recursive TSM Parsing (RTP) algorithm exploits local diagonal patterns in TSM to detect boundaries, and it is combined with the Boundary Contrastive (BoCo) loss to train our encoder to generate more informative TSMs. Our framework can be applied to both unsupervised and supervised settings, with both achieving state-of-the-art performance by a huge margin in GEBD benchmark. Especially, our unsupervised method outperforms the previous state-of-the-art "supervised" model, implying its exceptional efficacy.
翻译:新建的GEBD解答器是相关视频理解任务的简单延伸,无视GEBD的特性。在本文中,我们提出了一个未经监督/监督的GEBD新框架,使用时空自相异性母体(TSM)作为视频代表。新的SM 剖析算法(RTP)利用TSM的本地对角模式探测边界,并与边界对角(BoCo)损失相结合,以训练我们的编码器产生更多信息的 TSMs。我们的框架可以应用到未经监督和监督的环境,同时以GEBD基准的巨大空间实现最先进的性能。特别是,我们未经监督的模型方法超越了其先前的状态。