Weakly supervised video anomaly detection (WS-VAD) is a challenging problem that aims to learn VAD models only with video-level annotations. In this work, we propose a Long-Short Temporal Co-teaching (LSTC) method to address the WS-VAD problem. It constructs two tubelet-based spatio-temporal transformer networks to learn from short- and long-term video clips respectively. Each network is trained with respect to a multiple instance learning (MIL)-based ranking loss, together with a cross-entropy loss when clip-level pseudo labels are available. A co-teaching strategy is adopted to train the two networks. That is, clip-level pseudo labels generated from each network are used to supervise the other one at the next training round, and the two networks are learned alternatively and iteratively. Our proposed method is able to better deal with the anomalies with varying durations as well as subtle anomalies. Extensive experiments on three public datasets demonstrate that our method outperforms state-of-the-art WS-VAD methods.
翻译:弱监督视频异常检测(WS-VAD)是一项具有挑战性的问题,旨在仅利用视频级标注来学习VAD模型。在本文中,我们提出了一种长短时序共同教学(LSTC)方法来解决WS-VAD问题。它构建了两个基于管道的时空转换网络,分别从短期和长期视频剪辑中学习。每个网络都是通过基于多实例学习(MIL)的排序损失进行训练的,并且在剪辑级伪标签可用时,还使用交叉熵损失进行训练。采用共同教学策略来训练两个网络。也就是说,从每个网络生成的剪辑级伪标签用于监督下一轮的另一个网络,并且两个网络交替迭代学习。我们提出的方法能够更好地处理持续时间变化的异常以及细微的异常。在三个公共数据集上的广泛实验表明,我们的方法优于现有的WS-VAD方法。