We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable - our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, Ped2) and show significant improvements over the previous state-of-the-art.
翻译:我们为单层视频异常定位开发了一个新的框架,让人类能够理解系统做出决策的原因。我们首先了解物体及其动作的一般描述(使用深层网络),然后利用这些描述来构建一个高度的、取决于位置的模式。这个模型可以用来在同一场景的新视频中检测异常。重要的是,我们的方法可以解释——我们的高层外观和运动特征可以提供人类可以理解的理由,说明为什么视频的任何部分被归类为正常或异常。我们用标准视频异常检测数据集(街头景、CUHK大道、上海科技和UCSD Ped1、Ped2)进行实验,并展示了比以往最先进的显著改进。