In recent years, remarkable results have been achieved in self-supervised action recognition using skeleton sequences with contrastive learning. It has been observed that the semantic distinction of human action features is often represented by local body parts, such as legs or hands, which are advantageous for skeleton-based action recognition. This paper proposes an attention-based contrastive learning framework for skeleton representation learning, called SkeAttnCLR, which integrates local similarity and global features for skeleton-based action representations. To achieve this, a multi-head attention mask module is employed to learn the soft attention mask features from the skeletons, suppressing non-salient local features while accentuating local salient features, thereby bringing similar local features closer in the feature space. Additionally, ample contrastive pairs are generated by expanding contrastive pairs based on salient and non-salient features with global features, which guide the network to learn the semantic representations of the entire skeleton. Therefore, with the attention mask mechanism, SkeAttnCLR learns local features under different data augmentation views. The experiment results demonstrate that the inclusion of local feature similarity significantly enhances skeleton-based action representation. Our proposed SkeAttnCLR outperforms state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.
翻译:近年来,在使用对比学习进行自我监督的骨骼序列动作识别方面,取得了显着的成果。已经观察到,人类动作特征的语义区别通常由局部身体部位(如腿或手)表示,这对基于骨骼的动作识别具有优势。本文提出了一种基于注意力的对比学习框架,用于骨骼表示学习,称为SkeAttnCLR。该方法整合了局部相似性和全局特征,用于基于骨骼的动作表示。为了实现这一点,采用了多头注意力模块来学习骨骼的软关注掩码特征,在抑制非显著局部特征的同时加强局部显著特征,从而将相似的局部特征在特征空间中更接近。此外,通过使用全局特征扩展基于显著性和非显著性特征的对比对,生成充足的对比对,引导网络学习整个骨骼的语义表示。因此,采用注意掩码机制,SkeAttnCLR 在不同的数据增强视图下学习局部特征。实验结果表明,局部特征相似性的引入显著增强了基于骨骼的动作表示。我们所提出的SkeAttnCLR在NTURGB+D、NTU120-RGB+D和PKU-MMD数据集上超越了最先进的方法。