CNN 和双调GRU框架 (Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework)

Vision-based human activity recognition has emerged as one of the essential research areas in video analytics domain. Over the last decade, numerous advanced deep learning algorithms have been introduced to recognize complex human actions from video streams. These deep learning algorithms have shown impressive performance for the human activity recognition task. However, these newly introduced methods either exclusively focus on model performance or the effectiveness of these models in terms of computational efficiency and robustness, resulting in a biased tradeoff in their proposals to deal with challenging human activity recognition problem. To overcome the limitations of contemporary deep learning models for human activity recognition, this paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for human activity recognition. For efficient representation of human actions, we have proposed an efficient dual attentional convolutional neural network (CNN) architecture that leverages a unified channel-spatial attention mechanism to extract human-centric salient features in video frames. The dual channel-spatial attention layers together with the convolutional layers learn to be more attentive in the spatial receptive fields having objects over the number of feature maps. The extracted discriminative salient features are then forwarded to stacked bi-directional gated recurrent unit (Bi-GRU) for long-term temporal modeling and recognition of human actions using both forward and backward pass gradient learning. Extensive experiments are conducted, where the obtained results show that the proposed framework attains an improvement in execution time up to 167 times in terms of frames per second as compared to most of the contemporary action recognition methods.

翻译：在视频分析领域,基于远见的人类活动认识已成为一个重要的研究领域之一。在过去十年中,许多先进的深层次学习算法被引入了多种先进的深层次算法,以认识来自视频流的复杂的人类行动。这些深深层次学习算法显示了人类活动识别任务令人印象深刻的绩效。然而,这些新引入的方法要么完全侧重于模型性能,要么侧重于这些模型在计算效率和稳健性方面的效能,或者这些模型在计算效率和稳健性方面的有效性,从而导致在处理具有挑战性的人类活动识别问题的提案中出现偏差取舍取偏取。为克服当代活动识别现代深层学习模型的局限性,本文件介绍了一个计算高效但通用的空间时空级连锁框架,该框架利用深刻的歧视性时空和时间特征来识别人类活动;为在历史图上最深层执行的时空和时空定位图上的空间可接受空间可接受字段,同时利用最深层的时空定位框架,同时利用深层的时空定位的时空定位模型和后空定位动作,将空间空间可识别的后空定位定位定位定位定位定位定位定位定位,以显示长期的时空定位的时空定位,在前空间空间定位的时空定位的时空定位中,在不断的时空定位中,在不断的周期的周期的周期的周期的周期的周期的周期的周期的周期的周期的周期内,以前向前向前向后演进进进进进进进的后演后演后演后演后演后演后演后演后演进的后演后演后演后演演进的后演进。