In every connected smart city around the world, CCTVs have played a pivotal role in enforcing the safety and security of the citizens by recording unlawful activities for the authorities to take action. To ensure the efficiency and effectiveness of CCTVs in this domain, different DNN architectures were created and used by researchers and developers to either detect violence or detect weapons using bounding boxes or masks. These weapons are limited to guns, knives, and other obvious handheld weapons. To remove these limits and detect weapons more efficiently, non-weaponized violence footage from CCTV must be differentiable from weaponized ones. Since there are no current datasets that are tailored to this purpose of generalizability in weaponized violence detection, we introduced a new dataset that contains videos depicting weaponized violence, non-weaponized violence, and non-violent events. We also propose a novel data-centric method that arranges video frames into salient images while minimizing information loss for comfortable inference by SOTA image classifiers. This was done to simplify video classification tasks and optimize inference latency to improve sustainability in smart cities. Our experiments show that Image Classifiers can efficiently detect and distinguish violence with weapons from violence without weapons with performances as high as 99\% on our dataset, which are comparable with current SOTA 3D networks for action recognition and video classification.