The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion. Existing approaches typically employ a single neural representation for different motion patterns, which has difficulty in capturing fine-grained action classes given limited training data. To address the aforementioned problems, we propose a novel multi-granular spatio-temporal graph network for skeleton-based action classification that jointly models the coarse- and fine-grained skeleton motion patterns. To this end, we develop a dual-head graph network consisting of two interleaved branches, which enables us to extract features at two spatio-temporal resolutions in an effective and efficient manner. Moreover, our network utilises a cross-head communication strategy to mutually enhance the representations of both heads. We conducted extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, and achieves the state-of-the-art performance on all the benchmarks, which validates the effectiveness of our method.
翻译:由于多种颗粒和人类运动的巨大差异,基于骨骼的行动识别任务仍然是以人为中心的场景理解中的一项核心挑战。由于培训数据有限,现有方法通常对不同的运动模式采用单一神经代表,由于培训数据有限,难以捕捉精细阵列的行动类别。为了解决上述问题,我们提议建立一个基于骨骼的行动分类新颖的多面形阵列时钟图网络,以共同模拟粗糙和精细阵列的骨骼运动模式。为此,我们开发了一个双头图网络,由两个分叶分支组成,使我们能够以有效和高效的方式在两个时空决议中提取特征。此外,我们的网络利用一个跨头通信战略来相互加强两位领导人的表述。我们在三个大型数据集上进行了广泛的实验,即NTU RGB+D 60、NTU RGB+D 120和Kinetics-Skeleton,并实现了所有基准的状态,从而证实了我们方法的有效性。