Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
翻译:使用无人驾驶相机的人的活动识别(HAR)近年来吸引了计算机视觉研究界的极大兴趣。 一个强大而高效的HAR系统在视频监控、人群行为分析、体育分析以及人-计算机互动等领域具有关键作用。 它具有挑战性。 它包含复杂的构成、理解不同观点和采取行动的环境情景。 为了应对这些复杂问题,我们在本文件中提议了一个新型的微小轻视时间关注(SWTA)模块,以利用稀有抽样视频框架获得全球加权时间关注。 拟议的SWTA由两部分组成。 首先, 抽取时间段网络, 抽取一组特定框架。 第二, 加权时间关注, 包含从光学流中产生的关注地图的融合, 以及原始 RGB 图像。 随后有一个基础网网络, 包括一个革命性神经网络模块(CNN) 和完全相连的层, 为我们提供了活动识别。 SWTA网络可以用作现有的深CN 结构的插件模块, 以便通过消除当前286 的模型和最高时间段信息, 分别用于公开获取的精确度数据。