The significant growth of surveillance camera networks necessitates scalable AI solutions to efficiently analyze the large amount of video data produced by these networks. As a typical analysis performed on surveillance footage, video violence detection has recently received considerable attention. The majority of research has focused on improving existing methods using supervised methods, with little, if any, attention to the semi-supervised learning approaches. In this study, a reinforcement learning model is introduced that can outperform existing models through a semi-supervised approach. The main novelty of the proposed method lies in the introduction of a semi-supervised hard attention mechanism. Using hard attention, the essential regions of videos are identified and separated from the non-informative parts of the data. A model's accuracy is improved by removing redundant data and focusing on useful visual information in a higher resolution. Implementing hard attention mechanisms using semi-supervised reinforcement learning algorithms eliminates the need for attention annotations in video violence datasets, thus making them readily applicable. The proposed model utilizes a pre-trained I3D backbone to accelerate and stabilize the training process. The proposed model achieved state-of-the-art accuracy of 90.4% and 98.7% on RWF and Hockey datasets, respectively.
翻译:监控摄像网络的显著增长,使得高效分析这些网络产生的大量视频数据所需的可扩缩的AI解决方案成为了高效分析这些网络产生的大量视频数据的可扩展的AI解决方案。作为对监控录像片段进行的典型分析,视频暴力探测最近受到相当的注意。大部分研究侧重于利用监督方法改进现有方法,很少注意半监督学习方法。在这项研究中,引入了一个强化学习模式,该模式可以通过半监督方法优于现有模型。拟议方法的主要新颖之处在于引入半监督的硬关注机制。采用硬性关注机制,发现关键视频区域,并将其与数据中的非信息性部分分开。通过删除多余数据,并在更高分辨率中侧重于有用的视觉信息,提高了模型的准确性。使用半监督强化学习算法实施硬性关注机制,消除了视频暴力数据集对关注说明的需求,从而使其易于适用。拟议模式利用预先培训的I3D骨干加速和稳定培训进程。拟议的模型通过删除了90.4%和98.7%的RFSet和Asyal的数据,分别实现了90.4%和98.7%的状态。