In every connected smart city around the world, CCTVs have played a pivotal role in enforcing the safety and security of the citizens by recording unlawful activities for the authorities to take action. To ensure the efficiency and effectiveness of CCTVs in this domain, different DNN architectures were created and used by researchers and developers to either detect violence or detect weapons using bounding boxes or masks. These weapons are limited to guns, knives, and other obvious handheld weapons. To remove these limits and detect weapons more efficiently, non-weaponized violence footage from CCTV must be differentiable from weaponized ones. Since there are no current datasets that are tailored to this purpose of generalizability in weaponized violence detection, we introduced a new dataset that contains videos depicting weaponized violence, non-weaponized violence, and non-violent events. We also propose a novel data-centric method that arranges video frames into salient images while minimizing information loss for comfortable inference by SOTA image classifiers. This was done to simplify video classification tasks and optimize inference latency to improve sustainability in smart cities. Our experiments show that Image Classifiers can efficiently detect and distinguish violence with weapons from violence without weapons with performances as high as 99\% on our dataset, which are comparable with current SOTA 3D networks for action recognition and video classification.
翻译:在世界各地每一个相联的智能城市,闭路电视通过记录当局采取行动的非法活动,在加强公民的安全保障方面发挥了关键作用。为确保这一领域闭路电视的效率和有效性,研究人员和开发商创建并使用了不同的DNN结构,以便利用捆绑的盒子或面具探测暴力或武器。这些武器仅限于枪支、刀和其他明显的手持武器。为了消除这些限制并更有效地探测武器,闭路电视上的非武器化暴力镜头必须与武器化录像有区别。由于目前没有专门针对武器化暴力探测中可普遍化目的的数据集,我们引入了一个新的数据集,其中包含描述武器化暴力、非武器化暴力和非暴力事件的视频录像。我们还提出了一种新的以数据为中心的方法,将视频框架安排在突出的图像中,同时尽量减少信息损失,以适应SOTA图像分析员的推断。这样做是为了简化视频分类任务,优化推导力以增进智能城市的可持续性。我们的实验显示,图像分类师能够有效地探测和区分显示,我们当前暴力的视频网络与没有武器分类的高度识别。