Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos with only video-level labels. Currently, most state-of-the-art WSTAL methods follow a Multi-Instance Learning (MIL) pipeline: producing snippet-level predictions first and then aggregating to the video-level prediction. However, we argue that existing methods have overlooked two important drawbacks: 1) inadequate use of motion information and 2) the incompatibility of prevailing cross-entropy training loss. In this paper, we analyze that the motion cues behind the optical flow features are complementary informative. Inspired by this, we propose to build a context-dependent motion prior, termed as motionness. Specifically, a motion graph is introduced to model motionness based on the local motion carrier (e.g., optical flow). In addition, to highlight more informative video snippets, a motion-guided loss is proposed to modulate the network training conditioned on motionness scores. Extensive ablation studies confirm that motionness efficaciously models action-of-interest, and the motion-guided loss leads to more accurate results. Besides, our motion-guided loss is a plug-and-play loss function and is applicable with existing WSTAL methods. Without loss of generality, based on the standard MIL pipeline, our method achieves new state-of-the-art performance on three challenging benchmarks, including THUMOS'14, ActivityNet v1.2 and v1.3.
翻译:微弱的超强时间行动地方化(WSTAL)旨在将未剪辑的视频中的行动与仅视频级标签相匹配。目前,大多数最先进的WSTAL方法都遵循多动学习(MIL)管道:先产生片段级预测,然后汇总到视频级预测中。然而,我们认为,现有方法忽略了两个重要的缺点:(1) 运动信息使用不当,(2) 普遍存在的跨渗透性培训损失不相容。在本文中,我们分析光学流特征背后的运动提示是互为补充的。受此启发,我们提议在多动性学习(MIL)管道中先建立以背景为主的运动动作(MIL)管道:首先产生片段级预测,然后汇总到视频级预测。此外,为了强调更多的信息性视频片片片片片断,提议以运动性能计分调整网络培训。 广泛分析证实运动性强的光学模型动作流特征是互为补充性的, 动作性标定型的模型,包括运动性能和电路路段损失功能是不精确的。