Human action recognition is one of the challenging tasks in computer vision. The current action recognition methods use computationally expensive models for learning spatio-temporal dependencies of the action. Models utilizing RGB channels and optical flow separately, models using a two-stream fusion technique, and models consisting of both convolutional neural network (CNN) and long-short term memory (LSTM) network are few examples of such complex models. Moreover, fine-tuning such complex models is computationally expensive as well. This paper proposes a deep neural network architecture for learning such dependencies consisting of a 3D convolutional layer, fully connected (FC) layers, and attention layer, which is simpler to implement and gives a competitive performance on the UCF-101 dataset. The proposed method first learns spatial and temporal features of actions through 3D-CNN, and then the attention mechanism helps the model to locate attention to essential features for recognition.
翻译:人类行动认知是计算机愿景中具有挑战性的任务之一。当前行动识别方法使用成本高昂的计算模型来学习该动作的时空依赖性。模型分别使用RGB渠道和光学流,模型使用双流融合技术,模型由进化神经网络(CNN)和长期短期内存(LSTM)网络组成,这些复杂模型的例子很少。此外,微调这些复杂模型也是计算成本昂贵的。本文建议建立一个深神经网络结构,用于学习这种依赖性,包括3D相向层、完全相连的(FC)层和注意层,这比较容易实施,并在UCFC-101数据集上具有竞争性性能。拟议方法首先通过3D-CN学习行动的时空特征,然后通过关注机制帮助模型将注意力定位到识别的基本特征上。