Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions to construct a framework for localizing generic events in video. The features at multiple spatial dimensions of a pre-trained ResNet-50 are exploited with different views in the temporal dimension to form a temporal pyramid feature map. Based on that, the similarity between neighbor frames is calculated and projected to build a temporal pyramid similarity feature vector. A decoder with 1D convolution operations is used to decode these similarities to a new representation that incorporates their temporal relationship for later boundary score estimation. Extensive experiments conducted on the GEBD benchmark dataset show the effectiveness of our system and its variations, in which we outperformed the state-of-the-art approaches. Additional experiments on TAPOS dataset, which contains long-form videos with Olympic sport actions, demonstrated the effectiveness of our study compared to others.
翻译:通用事件探测(GEBD)旨在将视频分成成块,作为人类自然感知事件边界的一组广泛而多样的行动。在本研究中,我们提出一种方法,在空间和时间两个方面,考虑带有金字塔地貌图的相邻框架和金字塔地貌图之间的相互关系,以构建视频中通用事件本地化的框架。预先训练的ResNet-50的多空间维特征在时间维度上以不同观点加以利用,形成时空金字塔地貌特征图。在此基础上,计算并预测了相邻框架之间的相似性,以构建一个时间性金字塔相似的矢量。一个带有1D convolution操作的解码器,将这些相似性解码为新的表示,将它们的时间关系纳入以后的边界评分估计。在GEBD基准数据集上进行的广泛实验,显示了我们的系统及其变异性的有效性,我们在这个中超越了最先进的方法。关于TAPOS数据集的更多实验,其中包含与奥林匹克运动动作的长式视频,展示了我们的研究与其他研究的效果。