Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstruction-based methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.
翻译:视频异常探测(VAD)历来以两个主要方法处理:重建法和预测法。重建法学会了对输入图像的概括化,因此模型只学会了一种身份功能,并强烈地造成了所谓的概括化问题。另一方面,由于预测法学会了预测前几个框架下的未来框架,因此它们对于一般化问题不太敏感。然而,仍然不确定模型能否了解视频的片段时空背景。我们的直觉是,对视频片段时空背景的理解在VAD中发挥着关键作用,因为它提供了视频片段变化中事件外观的准确信息。因此,为了充分利用背景信息在视频环境中发现异常现象,我们设计了变异模型,有三个不同的背景预测流:遮掩、完整和局部。通过预测连续正常框架缺失的框架,我们的模型可以有效地学习视频中的各种正常化模式,这导致在异常情况下出现高度的重建错误,而这些异常案例与所了解的图像片段变化变化情况相比是不合适的。因此,为了在视频环境中充分利用视频显示我们的业绩评估,我们的数据比标准方法的。