For pursuing accurate skeleton-based action recognition, most prior methods use the strategy of combining Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action ``clapping hands''). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). The STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. The STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.
翻译:为了追求精确的基于骨架的行动识别,大多数先前的方法都采用将图形革命网络(GCNs)与基于关注的方法相结合的战略。然而,它们将人类骨骼视为完整的图表,导致不同行动(例如,手掌手在动作中的肘部和头部之间的联系)之间的变化较小。为此,我们建议建立一个新型的对比性GCN-转移网络(ConGT),它以平行的方式将空间和时间模块连接起来。ConGT包含两个平行的流:空间-时空图革命流(STG)和空间-时空变流(STT)。STG旨在获得维护人类骨骼自然的硬体结构的行动示意图。STT旨在获取包含各种联合之间全球关系的行动表示。由于从这两个流中产生的行动说明具有不同的特点,而且每一个都对另一个模型所知甚少,因此我们引入了对比性学习模式的模型的模范范式,以尽可能接近于中间的模范式。通过对比性化的模范式的模范式的模范,我们通过在相互对照的模范式的模范式的模范式的模范的模范中学习,可以进一步的模范的模化动作的模化动作的模范的模范的模范,我们通过在相互的模范的模范的模范的模范的模范的模的模范的模范的模范的模的模的模的模的模的模的模化的模范的模的模的模的模的模的模化的模的模的模样的动作,可以学习进的模。