Weakly-supervised temporal action localization (WTAL) in untrimmed videos has emerged as a practical but challenging task since only video-level labels are available. Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence, thus limiting their performance. In this paper, we tackle this problem from a new perspective by enhancing segment-level representations with a simple yet effective graph convolutional network, namely action complement graph network (ACGNet). It facilitates the current video segment to perceive spatial-temporal dependencies from others that potentially convey complementary clues, implicitly mitigating the negative effects caused by the two issues above. By this means, the segment-level features are more discriminative and robust to spatial-temporal variations, contributing to higher localization accuracies. More importantly, the proposed ACGNet works as a universal module that can be flexibly plugged into different WTAL frameworks, while maintaining the end-to-end training fashion. Extensive experiments are conducted on the THUMOS'14 and ActivityNet1.2 benchmarks, where the state-of-the-art results clearly demonstrate the superiority of the proposed approach.
翻译:由于只有视频标签,在未剪辑的视频中,微弱监管的时间行动定位(WTAL)已成为一项实际但富有挑战性的任务,因为只有视频标签,现有方法通常会影响现成的片段层面特征,这些特征存在空间上的不完整和时间上的不一致,从而限制其性能。在本文件中,我们从一个新的角度来解决这一问题,方法是通过一个简单而有效的图形共变网络,即行动补充图形网络(ACGNet),加强部分层面的表达方式,通过一个简单而有效的图形共振网络(即行动补充图形网络),加强这一层面的表述方式,加强这一问题。它便利目前的视频段从其他部分看到空间时的依赖性,从而有可能传递补充线索,暗含减轻上述两个问题造成的消极影响。通过这一方式,部分层面的特征对空间时空变化更具歧视性和强势,从而有助于更高的本地化。更重要的是,拟议中的ACGNet作为一个通用模块,可以灵活地插入不同的WTAL框架,同时保持端对端培训方式。在THUMOS'14和活动Net1.2基准上进行了广泛的实验,其中明确展示了拟议的状态优势。