Skeleton行动识别超光速变换器 (Hypergraph Transformer for Skeleton-based Action Recognition)

Skeleton-based action recognition aims to predict human actions given human joint coordinates with skeletal interconnections. To model such off-grid data points and their co-occurrences, Transformer-based formulations would be a natural choice. However, Transformers still lag behind state-of-the-art methods using graph convolutional networks (GCNs). Transformers assume that the input is permutation-invariant and homogeneous (partially alleviated by positional encoding), which ignores an important characteristic of skeleton data, i.e., bone connectivity. Furthermore, each type of body joint has a clear physical meaning in human motion, i.e., motion retains an intrinsic relationship regardless of the joint coordinates, which is not explored in Transformers. In fact, certain re-occurring groups of body joints are often involved in specific actions, such as the subconscious hand movement for keeping balance. Vanilla attention is incapable of describing such underlying relations that are persistent and beyond pair-wise. In this work, we aim to exploit these unique aspects of skeleton data to close the performance gap between Transformers and GCNs. Specifically, we propose a new self-attention (SA) extension, named Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order relations into the model. The K-hop relative positional embeddings are also employed to take bone connectivity into account. We name the resulting model Hyperformer, and it achieves comparable or better performance w.r.t. accuracy and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the significantly improved performance reached by our Hyperformer demonstrates the underestimated potential of Transformer models in this field.

翻译：以 Skeleton 为基础的行动识别旨在预测人类的行动, 与骨骼互联的人类联合坐标为人类的行动。此外, 建模这种离网数据点及其共同点, 以变异器为基础的配方将是一个自然的选择。然而, 变异器仍然落后于使用图形变异网络( GCNs ) 的先进方法。变异器假设输入是变异的和同质的( 部分通过定位编码来缓解 ), 这忽略了骨架数据的重要特征, 即骨骼连接。此外, 每种体联结在人类运动中具有明显的物理意义, 即, 运动无论联合坐标如何, 都保留着内在关系。然而, 变异形组合组合的组合往往涉及特定的行动, 例如用于保持平衡的潜意识的手动。 Vanilla 关注无法描述这种持续且超越双向的深层次关系。在这项工作中, 我们打算利用这些骨架数据的独特方面来缩小变异式变异的 RCN 和 GCN 的内基连接位置, 。最终的运行中, 我们提议将一个自定义的自我更新的运行的运行到。。自我更新的运行中的数据显示到。。