统一基于关键点的动作识别框架：基于结构化关键点池化的方法 (Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling)

This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.

翻译：本文同时解决了传统骨架动作识别中的三个问题：骨架检测和跟踪错误、目标动作种类单一，以及个人和帧动作识别。引入了点云深度学习模式来进行动作识别，并提出了一种统一框架及一种新颖的深度神经网络结构，称作结构化关键点池化。该方法基于数据结构（骨架固有的特性）的先验知识，如每个关键点属于的实例和帧，以级联的方式稀疏地聚合关键点特征，并实现了对输入错误的鲁棒性。其不受限制且无需跟踪的架构可将由人类骨架和非人类物体轮廓组成的时间序列关键点有效地视为一个输入的三维点云，并扩展了目标动作的多样性。此外，我们提出一种启发式于结构化关键点池化的池化-切换技巧。这种技巧在训练和推理阶段之间切换池化核，仅使用视频级动作标签就能以弱监督的方式检测个人和帧动作。这种技巧使我们的训练方案自然地引入了新的数据增强方法，该方法将来自不同视频的多个点云混合。在我们的实验中，我们全面验证了所提出方法相对于传统方法的效果，并相对于最前沿的骨架动作识别和时空动作定位方法表现出更好的性能。