It's common for current methods in skeleton-based action recognition to mainly consider capturing long-term temporal dependencies as skeleton sequences are typically long (>128 frames), which forms a challenging problem for previous approaches. In such conditions, short-term dependencies are few formally considered, which are critical for classifying similar actions. Most current approaches are consisted of interleaving spatial-only modules and temporal-only modules, where direct information flow among joints in adjacent frames are hindered, thus inferior to capture short-term motion and distinguish similar action pairs. To handle this limitation, we propose a general framework, coined as STGAT, to model cross-spacetime information flow. It equips the spatial-only modules with spatial-temporal modeling for regional perception. While STGAT is theoretically effective for spatial-temporal modeling, we propose three simple modules to reduce local spatial-temporal feature redundancy and further release the potential of STGAT, which (1) narrow the scope of self-attention mechanism, (2) dynamically weight joints along temporal dimension, and (3) separate subtle motion from static features, respectively. As a robust feature extractor, STGAT generalizes better upon classifying similar actions than previous methods, witnessed by both qualitative and quantitative results. STGAT achieves state-of-the-art performance on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400. Code is released.
翻译:由于骨架序列一般是很长的(>128框架),因此对以往的做法构成一个具有挑战性的问题。在这种情况下,短期依赖很少得到正式考虑,对类似行动进行分类至关重要。目前的方法大多包括空基单一模块和时间单一模块,这些模块在相邻框架中的连接点之间的直接信息流动受到阻碍,因此无法捕捉短期运动和区分400对行动。为了处理这一限制,我们提议了一个通用框架,作为STGAT, 以模拟跨空间时间信息流动为形式。在这种条件下,短期依赖很少得到正式考虑,对于对类似行动进行分类至关重要。虽然STGAT在理论上对空间时尚建模是有效的,但我们提议了三个简单的模块,以减少当地空间时空基特征的冗余,并进一步释放STGAT的潜力,这(1) 缩小了自留机制的范围,(2) 动态权重结合时间层面,(3) 与静态特征分开的微妙动作。 STAAT分别通过一个强的特征提取和定性动作,在前三个层次上,通过一个类似于的SKAT-TU(SK-G)的量化方法,在前三个层次上都实现了。