Automatically detecting violence from surveillance footage is a subset of activity recognition that deserves special attention because of its wide applicability in unmanned security monitoring systems, internet video filtration, etc. In this work, we propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet where one stream takes in background suppressed frames as inputs and other stream processes difference of adjacent frames. We employed simple and fast input pre-processing techniques that highlight the moving objects in the frames by suppressing non-moving backgrounds and capture the motion in-between frames. As violent actions are mostly characterized by body movements these inputs help produce discriminative features. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution that enables producing robust long-range Spatio-temporal features while using substantially fewer parameters. We experimented with three fusion methods to combine the output feature maps of the two streams. Evaluation of the proposed methods was done on three standard public datasets. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin while matching state-of-the-art results on the smaller datasets. Our experiments lead us to conclude, the proposed models are superior in terms of both computational efficiency and detection accuracy.
翻译:从监视镜头中自动发现暴力是活动认识的一部分,值得特别注意,因为它广泛适用于无人驾驶安全监测系统、互联网视频过滤等。 在这项工作中,我们建议建立一个高效的双流深学习结构,利用分解的革命LSTM(Sep ConvLSTM)和预先训练的移动网络,使一个流以背景封闭的框作为投入和相邻框架的其他流过程差异。我们采用了简单和快速输入的预处理技术,通过抑制不移动的背景和捕捉框架之间的运动来突出框架内移动物体。由于暴力行动主要以身体运动为特征,这些投入有助于产生歧视性特征。ConvLSTM的每个入口处,通过替换卷动操作来构建一个高效的双向深层分解(ConvLSTM)和预培训的移动网络,从而能够产生强大的长距离的Spatio-时空特征,同时使用少得多的参数。我们尝试了三种混合方法,将两个流的输出特征地图组合在一起。在三个标准的公共数据集上进行了评估。我们的模型比我们在更大和更具有挑战性比例的定位的2号的计算结果中,而更精确的模型则以更精确的比我们比较精确的计算结果。