The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.
翻译:现代深层学习方法在暴力探测中取得了良好的准确性,并且由于在智能监测系统中的应用性而证明是成功的。然而,模型在计算上是昂贵的,规模很大,因为它们的特征提取方法效率低。这项工作提出了一种新型的暴力探测结构,称为双流多维共变网络(2s-MDCN),它使用RGB框架和光学流来探测暴力。我们提议的方法单独从1D、2D和3D组合中提取时间和空间信息。尽管将多维共变网络结合起来,但由于频道能力下降,我们的模型是轻巧和高效的,但是它们学会提取有意义的空间和时间信息。此外,将RGB框架和光学流结合起来比单一RGB流的精确度高出2.2%。尽管不那么复杂,我们的模型在最大的暴力探测基准数据集中获得了89.7%的最新精确度。