Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained classification problem. Our method exhibits several advantages over existing works: 1) the spatio-temporal jigsaw puzzles are decoupled in terms of spatial and temporal dimensions, responsible for capturing highly discriminative appearance and motion features, respectively; 2) full permutations are used to provide abundant jigsaw puzzles covering various difficulty levels, allowing the network to distinguish subtle spatio-temporal differences between normal and abnormal events; and 3) the pretext task is tackled in an end-to-end manner without relying on any pre-trained models. Our method outperforms state-of-the-art counterparts on three public benchmarks. Especially on ShanghaiTech Campus, the result is superior to reconstruction and prediction-based methods by a large margin.
翻译:视频异常探测(VAD)是计算机视野中的一个重要话题。本文以自监学习的最新进展为动力,通过解决一个直觉而具挑战性的借口任务(即spatio-temoor jigsaw 拼图)来解决VAD问题,这个拼图是一个多标签细微分分类问题。我们的方法比现有作品具有若干优势:(1) 时空拼图在空间和时间层面上脱钩,分别负责捕捉高度歧视的外观和运动特征;(2) 全面拼图用来提供大量拼图,覆盖各种困难程度,使网络能够区分正常事件和异常事件之间的微妙的时空差异;(3) 以端到端的方式处理这个托盘任务,而不必依赖任何预先培训的模型。我们的方法在三个公共基准上超越了最先进的对应方法。特别是在上海科技校园,其结果比大幅度的重建和预测方法要优越。