Weakly supervised temporal action localization aims to detect and localize actions in untrimmed videos with only video-level labels during training. However, without frame-level annotations, it is challenging to achieve localization completeness and relieve background interference. In this paper, we present an Action Unit Memory Network (AUMN) for weakly supervised temporal action localization, which can mitigate the above two challenges by learning an action unit memory bank. In the proposed AUMN, two attention modules are designed to update the memory bank adaptively and learn action units specific classifiers. Furthermore, three effective mechanisms (diversity, homogeneity and sparsity) are designed to guide the updating of the memory network. To the best of our knowledge, this is the first work to explicitly model the action units with a memory network. Extensive experimental results on two standard benchmarks (THUMOS14 and ActivityNet) demonstrate that our AUMN performs favorably against state-of-the-art methods. Specifically, the average mAP of IoU thresholds from 0.1 to 0.5 on the THUMOS14 dataset is significantly improved from 47.0% to 52.1%.
翻译:微弱监管的时间行动本地化旨在检测在未剪辑的视频中的行动,并将其本地化,在培训期间只贴有视频级标签,然而,如果没有框架级的注释,实现本地化和缓解背景干扰就具有挑战性。在本文中,我们提出了一个行动单位记忆网络(AUMN),用于低监管的时间行动本地化,通过学习一个行动单位记忆库,可以缓解上述两项挑战。在拟议的AUMN中,有两个关注模块旨在更新记忆库,适应性地学习行动单位的特定分类。此外,三个有效机制(多样性、同质性和宽度)旨在指导记忆网络的更新。根据我们的知识,这是以记忆网络明确模拟行动单位的首项工作。两个标准基准(THUMOS14和活动网)的广泛实验结果显示,我们的AUMN对最先进的方法表现良好。具体地说,THUOS14数据集的IOU平均MAP阈值从0.1%提高到0.5 %,从47.0%大幅提高到52.1%。