Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an "action-ness" score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at: https://github.com/asrafulashiq/hamnet.
翻译:微弱监管的时间行动本地化是一项具有挑战性的愿景任务,因为培训视频中缺少地面真实的时间行动位置。 由于培训过程中只有视频层面的监督,大多数现有方法都依赖于多实例学习框架(MIL)来预测视频中每个行动类别的开始和结束框架。然而,基于MIL的现有方法主要局限了仅仅捕捉行动最有歧视的框架,忽视了整个活动的范围。此外,这些方法无法有效地模拟背景活动,在将地面活动本地化方面起着重要作用。在本文中,我们提出了一个名为 HAM-Net 的新框架,并有一个混合关注机制,其中包括时间柔软、半软和难以关注来解决这些问题。我们的时间软关注模块,在分类模块的一个辅助背景类的指导下,通过对每个视频片断引入“行动性”评分来模拟背景活动。此外,我们的时间性半软和硬关注模块,计算每部视频精度的注意分数,可以帮助在最小行动界限上找到较不具有歧视性的行动框架 HAM-Net 。我们提议采用的最新的“0.25 ” 数据方法,在最低行动界限上以2.O- mperAP 0. 0. 格式数据格式取代了数据。