3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.
翻译:3D Convolution Neal Network (3D CNN) 捕捉了3D数据(如视频序列)的空间和时间信息,然而,由于变化和集合机制,信息损失似乎不可避免。为了改进3D CNN的视觉解释和分类,我们建议了两种方法:i) 利用训练有素的 3DResNext 网络,从全球到地方(全球-地方)离散梯度的集合层到当地(全球-地方)的离散梯度,以及ii) 实施关注网,以提高行动识别的准确性。拟议方法的目的是通过视觉归属、不受监督的行动本地化和行动识别,显示3DCNN中被称为全球-地方关注的每一层的有用性。首先,3DResNext经过培训,运用关于最大预测阶级的背面调整,应用对行动分类进行应用,对行动分类加以应用。随后,对每个层的梯度和启动进行上层进行更细微的汇总,用来引起人们的注意,从而将最终注意力归为最终定位。我们用最后定位的定点,我们评估了空间和时间行动的图像分类结果,通过图像分析式通过图像解析法解释。