Temporal action localization (TAL) in videos is a challenging task, especially due to the large scale variation of actions. In the data, short actions usually occupy the major proportion, but have the lowest performance with all current methods. In this paper, we confront the challenge of short actions and propose a multi-level cross-scale solution dubbed as video self-stitching graph network (VSGN). We have two key components in VSGN: video self-stitching (VSS) and cross-scale graph pyramid network (xGPN). In VSS, we focus on a short period of a video and magnify it along the temporal dimension to obtain a larger scale. By our self-stitching approach, we are able to utilize the original clip and its magnified counterpart in one input sequence to take advantage of the complementary properties of both scales. The xGPN component further exploits the cross-scale correlations by a pyramid of cross-scale graph networks, each containing a hybrid temporal-graph module to aggregate features from across scales as well as within the same scale. Our VSGN not only enhances the feature representations, but also generates more positive anchors for short actions and more short training samples. Experiments demonstrate that VSGN obviously improves the localization performance of short actions as well as achieving the state-of-the-art overall performance on ActivityNet-v1.3, reaching an average mAP of 35.07 %.
翻译:视频中的时间行动本地化(TAL)是一项艰巨的任务,特别是由于行动的规模差异很大。在数据中,短动作通常占主要比例,但使用目前所有方法的性能最低。在本文中,我们面对短动作的挑战,并提出了一个多层次的跨规模解决方案,称为视频自我切换图形网络(VSGN),我们在VSGN中有两个关键组成部分:视频自我切换(VSS)和跨比例图形金字塔网络(xGPN)。在VSS中,我们侧重于一段短时间的视频,并沿着时间层面放大它,以获得更大的规模。通过我们自我切换的方法,我们能够利用最初的剪辑及其放大的对应方在一个输入序列中利用两个尺度的互补特性。 XGPN组件进一步利用跨比例图形网络的金字塔(VSS)的跨比例关联性关系,每个结构中包含一个混合时间测图模块,以便从不同尺度和同一规模的综合特征。我们的VSGNGN不仅能增强地显示功能的特征展示,而且显然能够使VGNA的短期动作升级行动达到短期的模型。