Weakly supervised video anomaly detection (WS-VAD) is to distinguish anomalies from normal events based on discriminative representations. Most existing works are limited in insufficient video representations. In this work, we develop a multiple instance self-training framework (MIST)to efficiently refine task-specific discriminative representations with only video-level annotations. In particular, MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder that aims to automatically focus on anomalous regions in frames while extracting task-specific representations. Moreover, we adopt a self-training scheme to optimize both components and finally obtain a task-specific feature encoder. Extensive experiments on two public datasets demonstrate the efficacy of our method, and our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
翻译:微弱监督的视频异常现象检测(WS-VAD)是区分异常现象和基于歧视表现的正常事件(WS-VAD),大多数现有工程都有限,没有足够的视频表现。在这项工作中,我们开发了多例自培训框架(MIST),以有效完善任务特有的歧视表现,只有视频级别的注释。特别是,MIST由以下组成:1)多例假标签生成器,它调整了稀疏的连续取样战略,以产生更可靠的剪辑级假标签;2)自导引人注意的特征编码器,目的是自动侧重于框架中的异常区域,同时提取特定任务的表达方式。此外,我们采取了自我培训计划,优化两个组成部分,并最终获得一个特定任务特性编码器。关于两个公共数据集的广泛实验展示了我们的方法的功效,我们的方法比现有的受监管和薄弱监督的方法更具有可比性,甚至更好,具体获得了上海科技的AUC94.83%的框架值。