Law enforcement and city safety are significantly impacted by detecting violent incidents in surveillance systems. Although modern (smart) cameras are widely available and affordable, such technological solutions are impotent in most instances. Furthermore, personnel monitoring CCTV recordings frequently show a belated reaction, resulting in the potential cause of catastrophe to people and property. Thus automated detection of violence for swift actions is very crucial. The proposed solution uses a novel end-to-end deep learning-based video vision transformer (ViViT) that can proficiently discern fights, hostile movements, and violent events in video sequences. The study presents utilizing a data augmentation strategy to overcome the downside of weaker inductive biasness while training vision transformers on a smaller training datasets. The evaluated results can be subsequently sent to local concerned authority, and the captured video can be analyzed. In comparison to state-of-theart (SOTA) approaches the proposed method achieved auspicious performance on some of the challenging benchmark datasets.
翻译:虽然现代(智能)摄像机广泛可用,而且价格低廉,但大多数情况下这类技术解决方案都是无能的;此外,监控闭路电视录音的人员经常显示迟到的反应,从而可能导致人员和财产遭受灾难;因此,为迅速行动自动发现暴力非常重要;拟议解决方案使用新型的端到端深学习的视频视觉变压器(ViviT),能够在视频序列中洞察到争斗、敌对运动和暴力事件;该研究介绍了利用数据增强战略克服较弱的感性偏差的下端,同时在较小的培训数据集上培训视觉变异器;评估的结果随后可发送给当地有关当局,并分析所捕捉到的视频。与最新工艺(SOTA)相比,拟议方法在一些具有挑战性的基准数据集上取得了良好的表现。