Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. To overcome this challenge, one recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames. To this end, we convert the action-click supervision to the background-click supervision and develop a novel method, called BackTAL. Specifically, BackTAL implements two-fold modeling on the background video frames, i.e. the position modeling and the feature modeling. In position modeling, we not only conduct supervised learning on the annotated video frames but also design a score separation module to enlarge the score differences between the potential action frames and backgrounds. In feature modeling, we propose an affinity module to measure frame-specific similarities among neighboring frames and dynamically attend to informative neighbors when calculating temporal convolution. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of the established BackTAL and the rationality of the proposed background-click supervision. Code is available at https://github.com/VividLe/BackTAL.
翻译:微弱监管时间行动本地化的目的是从视频级标签中学习实例级行动模式,其中一个重大挑战是行动-图文混淆。为了克服这一挑战,最近的一项工作建立了一个行动点击监督框架。它需要类似的批注成本,但可以稳步改善与常规薄弱监管方法相比的本地化性能。在本文中,通过揭示现有方法的性能瓶颈主要来自背景错误,我们发现一个更强大的行动定位器可以在背景视频框架而不是行动框架上用标签来培训。为此,我们将行动点击监督转换为背景点击监督,并开发一种新颖的方法,称为 BackTAL。具体地说,BackTAL在背景视频框架上实施双倍建模,即位置建模和功能模型。在建模时,我们不仅对附加说明的视频框架进行有监督性的学习,而且还设计一个分分分离模块,以扩大潜在行动框架和背景之间的分数差异。在建模中,我们提议一个缩缩缩缩缩督导模块,以测量基准-具体框架-基准背景性能比,在邻居间进行高空基级框架和高基级基准测试时,显示高空基级基准。