Video anomaly detection (VAD) is a significant computer vision problem. Existing deep neural network (DNN) based VAD methods mostly follow the route of frame reconstruction or frame prediction. However, the lack of mining and learning of higher-level visual features and temporal context relationships in videos limits the further performance of these two approaches. Inspired by video codec theory, we introduce a brand-new VAD paradigm to break through these limitations: First, we propose a new task of video event restoration based on keyframes. Encouraging DNN to infer missing multiple frames based on video keyframes so as to restore a video event, which can more effectively motivate DNN to mine and learn potential higher-level visual features and comprehensive temporal context relationships in the video. To this end, we propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration, where a cross-attention and a temporal upsampling residual skip connection are introduced to further assist in restoring complex static and dynamic motion object features in the video. In addition, we propose a simple and effective adjacent frame difference loss to constrain the motion consistency of the video sequence. Extensive experiments on benchmarks demonstrate that USTN-DSC outperforms most existing methods, validating the effectiveness of our method.
翻译:视频异常检测 (VAD) 是一个重要的计算机视觉问题。现有的基于深度神经网络 (DNN) 的 VAD 方法主要遵循帧重建或帧预测的路线。然而,视频中更高层次的视觉特征和时间上下文关系的挖掘和学习的缺乏限制了这两种方法的进一步性能。受到视频编解码理论的启发,我们引入了一个全新的 VAD 计划以突破这些限制:首先,我们提出了一个新的视频事件恢复任务,该任务基于关键帧鼓励 DNN 基于视频关键帧来推断缺失的多帧以恢复视频事件,这可以更有效地激励 DNN 挖掘和学习视频中的潜在高层次的视觉特征和全面的时间上下文关系。为此,我们提出了一种新颖的双重跳跃连接 U 字形 Swin Transformer 网络(USTN-DSC)以进行视频事件恢复,在其中引入了交叉注意力和时间上采样剩余跳跃连接来进一步帮助恢复视频中的复杂静态和动态运动对象特征。此外,我们提出了一个简单有效的相邻帧差损失来约束视频序列的运动一致性。大量基准实验表明,USTN-DSC 超越了大多数现有方法,验证了我们方法的有效性。