Occlusions are universal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recognition, does not explicitly consider occlusions despite their everyday pervasiveness. In this work, we explicitly tackle body occlusions for Skeleton-based One-shot Action Recognition (SOAR). We mainly consider two occlusion variants: 1) random occlusions and 2) more realistic occlusions caused by diverse everyday objects, which we generate by projecting the existing IKEA 3D furniture models into the camera coordinate system of the 3D skeletons. We leverage the proposed pipeline to blend out portions of skeleton sequences of the three popular action recognition datasets (NTU-120, NTU-60 and Toyota Smart Home) and formalize the first benchmark for SOAR from partially occluded body poses. This is the first benchmark which considers occlusions for data-scarce action recognition. Another key property of our benchmark are the more realistic occlusions generated by everyday objects, as even in standard recognition from 3D skeletons, only randomly missing joints were considered. We re-evaluate state-of-the-art frameworks in the light of this new task and further introduce Trans4SOAR, a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions. While our experiments demonstrate a clear decline in accuracy with missing skeleton portions, this effect is smaller with Trans4SOAR, which outperforms other architectures on all datasets. Trans4SOAR additionally yields state-of-the-art in the standard SOAR, surpassing the best published approach by 2.85% on NTU-120.
翻译:在现实世界中,隐蔽是普遍存在的120个现象。特别是对于诸如人体骨骼等稀疏的表达形式,少数隐蔽点可能会破坏对结果产生严重影响的几何和时间连续性。然而,从骨骼序列中进行的数据分解识别研究,例如一发动作识别,并没有明确考虑尽管每天普遍存在的分解。在这项工作中,我们明确处理基于Skeeton的单发动作识别(SOAR)的体格分解。我们主要考虑两种分解变量:1)随机隐蔽和2)由不同日常物体造成的更现实的分解。我们通过将现有的 IKEA 3D 家具模型投放到3D 骨架的相机协调系统中,我们利用拟议的管道将三种流行动作识别数据集(NTU-120、NTU-60和丰田Smart Home Home)的分解。我们从部分隐蔽体中确定SOAR的第一个基准。这是第一个基准,我们将数据分解的分解(SOL4)模型的分解(SOLed 4)的分解的分解,这是第一个基准,我们将数据分解的分解的分解的分解的分解的分解的分解的分解的分解的分解的分解的分解-分解结果的分解-分解结果的分解结果,另一个的分解结果的分解结果的分解过程的分解结果的分解结果在日常的分解过程的分解, 的分解结果的分解结果的分解为我们的分解结果的分解,另一个的分解的分解的分解的分解的分解的分解的分解结果的分解,这是我们的分解的分解的分解的分解的分解的分解的分解的分解的分解,我们的分解的分解的分解的分解的分解是我们的分解的分解的分解的分解的分解的分解的分解的分解的分解体的分解的分解的分解,我们的分解的分解,我们的分解的分解的分解的分解的分解的分解的分解的分解的分解的分解的分解的分解的分