Weakly supervised temporal action localization (WTAL) aims to localize actions in untrimmed videos with only weak supervision information (e.g. video-level labels). Most existing models handle all input videos with a fixed temporal scale. However, such models are not sensitive to actions whose pace of the movements is different from the ``normal" speed, especially slow-motion action instances, which complete the movements with a much slower speed than their counterparts with a normal speed. Here arises the slow-motion blurred issue: It is hard to explore salient slow-motion information from videos at ``normal" speed. In this paper, we propose a novel framework termed Slow Motion Enhanced Network (SMEN) to improve the ability of a WTAL network by compensating its sensitivity on slow-motion action segments. The proposed SMEN comprises a Mining module and a Localization module. The mining module generates mask to mine slow-motion-related features by utilizing the relationships between the normal motion and slow motion; while the localization module leverages the mined slow-motion features as complementary information to improve the temporal action localization results. Our proposed framework can be easily adapted by existing WTAL networks and enable them be more sensitive to slow-motion actions. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of our proposed framework.
翻译:微弱监管的时间行动本地化(WTAL)旨在将未剪辑的视频中的行动定位为本地化,但只有薄弱的监督信息(例如视频级标签)。大多数现有模式处理所有固定时间比例的输入视频。然而,这些模式对于移动速度不同于“正常”速度的行动并不敏感,特别是慢动作动作动作,其完成速度比对口方以正常速度慢得多。这里出现了慢动作模糊问题:很难探索以“正常”速度从视频中获取的显著慢动作信息。在本文中,我们提议了一个名为“慢动作增强网络(SMEN)”的新颖框架,以补偿对慢动作行动部分的敏感度,从而提高WTAL网络的能力。提议的SMEN包含一个采矿模块和一个本地化模块。采矿模块利用正常动作和慢动作之间的关系,为与缓慢动作有关的地雷掩罩;而本地化模块则利用埋设的慢动作特征作为补充信息,改进时间动作本地化结果。我们提议的框架可以很容易地调整,通过现有的高姿态网络的敏感度框架进行。