Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community in the past few years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames for obtaining global weighted temporal fusion outcome. The proposed SWTF is divided into two components. First, a temporal segment network that sparsely samples a given set of frames. Second, weighted temporal fusion, that incorporates a fusion of feature maps derived from optical flow, with raw RGB images. This is followed by base-network, which comprises a convolutional neural network module along with fully connected layers that provide us with activity recognition. The SWTF network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a significant margin.
翻译:过去几年来,基于无人机的无人机的人类活动识别(HAR)得到了计算机视觉研究界的极大关注。一个强大而高效的HAR系统在视频监视、人群行为分析、体育分析以及人机互动等领域具有关键作用。它之所以具有挑战性,是复杂的构成,理解不同的观点,以及采取行动的环境情景。为了解决这些复杂问题,我们在本文件中提议了一个新型的Sprassy Weight Temal Temal Fultion(SWTF)模块,以利用稀有的抽样视频框架获取全球加权时间融合结果。拟议的SWTF分为两个组成部分。首先,一个时间段网络,对特定框架进行稀释。第二,加权时间段融合,将光学流产生的地貌地图与原始 RGB 图像融合在一起。接下来是基础网络,其中包括一个革命性神经网络模块,以及一个完全相连的层,让我们了解活动。 SWTFTF网络可以用作现有的深层CNN结构的插件模块,通过消除可用的时间段信息,优化它们学习时间段信息,从而消除一个单独的时间流的精确度数据,从而评估了ODLA的精确度数据。