We propose a lightweight and accurate method for detecting anomalies in videos. Existing methods used multiple-instance learning (MIL) to determine the normal/abnormal status of each segment of the video. Recent successful researches argue that it is important to learn the temporal relationships among segments to achieve high accuracy, instead of focusing on only a single segment. Therefore we analyzed the existing methods that have been successful in recent years, and found that while it is indeed important to learn all segments together, the temporal orders among them are irrelevant to achieving high accuracy. Based on this finding, we do not use the MIL framework, but instead propose a lightweight model with a self-attention mechanism to automatically extract features that are important for determining normal/abnormal from all input segments. As a result, our neural network model has 1.3\% of the number of parameters of the existing method. We evaluated the frame-level detection accuracy of our method on three benchmark datasets (UCF-Crime, ShanghaiTech, and XD-Violence) and demonstrate that our method can achieve the comparable or better accuracy than state-of-the-art methods.
翻译:我们建议了一种轻量级和准确的方法来探测视频中的异常现象。 现有的方法使用多功能学习(MIL)来确定视频中每个部分的正常/异常状态。 最近的成功研究认为,必须了解各部分之间的时间关系,以便实现高精确度,而不是只关注一个部分。 因此,我们分析了近年来成功的现有方法,发现虽然共同学习所有部分确实很重要,但它们之间的时间顺序与实现高精确度无关。 基于这一发现,我们不使用MIL框架,而是建议使用一个带有自我注意机制的轻量级模型,自动提取对于确定所有输入部分的正常/异常度十分重要的特征。 因此,我们的神经网络模型现有方法参数数量只有1.3 ⁇ 。 我们评估了我们三个基准数据集(UCF-Drime、上海科技和XD-violence)方法的框架级检测准确度,并表明我们的方法可以达到比最新方法更近或更精确度。