We present a novel unsupervised deep learning framework for anomalous event detection in complex video scenes. While most existing works merely use hand-crafted appearance and motion features, we propose Appearance and Motion DeepNet (AMDN) which utilizes deep neural networks to automatically learn feature representations. To exploit the complementary information of both appearance and motion patterns, we introduce a novel double fusion framework, combining both the benefits of traditional early fusion and late fusion strategies. Specifically, stacked denoising autoencoders are proposed to separately learn both appearance and motion features as well as a joint representation (early fusion). Based on the learned representations, multiple one-class SVM models are used to predict the anomaly scores of each input, which are then integrated with a late fusion strategy for final anomaly detection. We evaluate the proposed method on two publicly available video surveillance datasets, showing competitive performance with respect to state of the art approaches.
翻译:我们为在复杂的视频场景中探测异常事件提供了一个新的未经监督的深层次学习框架; 虽然大多数现有作品只是使用手工制作的外观和运动特征,但我们提议利用深神经网络自动学习特征表征; 为了利用外观和运动模式的互补信息, 我们引入了一个新的双重融合框架, 将传统的早期融合和晚融合战略的好处结合起来; 具体地说, 堆叠的去音自动操作器被提议分开学习外观和运动特征以及联合代表( 早期融合) 。 根据所学的演示, 使用多个单级 SVM 模型来预测每项输入的异常分数, 然后与晚融合战略相结合, 用于最后异常检测。 我们评估了两个公开提供的视频监控数据集的拟议方法, 显示在艺术方式方面有竞争力的表现 。