Learning discriminative features for effectively separating abnormal events from normality is crucial for weakly supervised video anomaly detection (WS-VAD) tasks. Existing approaches, both video and segment-level label oriented, mainly focus on extracting representations for anomaly data while neglecting the implication of normal data. We observe that such a scheme is sub-optimal, i.e., for better distinguishing anomaly one needs to understand what is a normal state, and may yield a higher false alarm rate. To address this issue, we propose an Uncertainty Regulated Dual Memory Units (UR-DMU) model to learn both the representations of normal data and discriminative features of abnormal data. To be specific, inspired by the traditional global and local structure on graph convolutional networks, we introduce a Global and Local Multi-Head Self Attention (GL-MHSA) module for the Transformer network to obtain more expressive embeddings for capturing associations in videos. Then, we use two memory banks, one additional abnormal memory for tackling hard samples, to store and separate abnormal and normal prototypes and maximize the margins between the two representations. Finally, we propose an uncertainty learning scheme to learn the normal data latent space, that is robust to noise from camera switching, object changing, scene transforming, etc. Extensive experiments on XD-Violence and UCF-Crime datasets demonstrate that our method outperforms the state-of-the-art methods by a sizable margin.
翻译:现有方法,包括视频和片段标签,主要侧重于采集异常数据表征,同时忽视正常数据的影响。我们发现,这种方法并不理想,即为了更好地区分异常现象,人们需要了解什么是正常状态,并可能产生更高的假警报率。为了解决这一问题,我们提议了一个不确定调节的双重内存单位(UR-DMU)模型,以了解正常数据的表述和异常数据的歧视性特征。为了具体地在图动画网络的传统全球和地方结构的启发下,我们为变异器网络引入了一个全球和地方多头自关注模块(GL-MHSA),以获得更清晰的嵌入式,从而了解什么是正常状态,并有可能产生更高的虚假警报率。然后,我们用两个记忆库,一个是处理硬样品的异常记忆库,储存和分别的异常和正常原型,并最大限度地利用两个图像的边距。最后,我们提议了一个不确定性学习计划,在图变动网络上传统的全球和地方结构中,通过改变正常的视野,将数据转换为稳定的空间,通过改变正常的、稳定的空间的模型,将数据转换成一种稳定的系统,从正常的模型,将数据转换为从稳定的空间系统,将数据转换为从正常的系统,将数据转换为稳定的空间变化的模型的模型的系统。