Video anomaly detection in surveillance systems with only video-level labels (i.e. weakly-supervised) is challenging. This is due to, (i) the complex integration of human and scene based anomalies comprising of subtle and sharp spatio-temporal cues in real-world scenarios, (ii) non-optimal optimization between normal and anomaly instances under weak supervision. In this paper, we propose a Human-Scene Network to learn discriminative representations by capturing both subtle and strong cues in a dissociative manner. In addition, a self-rectifying loss is also proposed that dynamically computes the pseudo temporal annotations from video-level labels for optimizing the Human-Scene Network effectively. The proposed Human-Scene Network optimized with self-rectifying loss is validated on three publicly available datasets i.e. UCF-Crime, ShanghaiTech and IITB-Corridor, outperforming recently reported state-of-the-art approaches on five out of the six scenarios considered.
翻译:在只有视频级别标签的监测系统(即受微弱监督的标签)中,录像异常现象的探测具有挑战性,其原因是:(一) 在现实世界情景中,基于人类和场景的异常现象的复杂结合,其中包括微妙和尖锐的时空线,(二) 在监管不力的正常和异常情况之间非最佳优化,在本文中,我们提议建立一个人类-气候网络,通过分解捕捉微妙和强烈的提示来了解歧视的表现形式。此外,还提议进行自我核实的损失,动态地将视频级别标签的假时间说明进行编译,以便有效地优化人类-环境网络。拟议的人类-气候网络在自我验证损失后优化了三个公开存在的数据集,即UCF-Crimination、上海科技和IITB-Corrodor,在所考虑的六种情景中的五种情形中表现优于最新报道的状态方法。