Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. State-of-the-art methods directly learn image-based embedding space by leveraging powerful deep convolutional neural networks. While being straightforward, their results are far from satisfactory, the aligned videos exhibit severe temporal discontinuity without additional post-processing steps. The recent advancements in human body and hand pose estimation in the wild promise new ways of addressing the task of human action alignment in videos. In this work, based on off-the-shelf human pose estimators, we propose a novel context-aware self-supervised learning architecture to align sequences of actions. We name it CASA. Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions, which can solve the temporal discontinuity problem. Moreover, we introduce a self-supervised learning scheme that is empowered by novel 4D augmentation techniques for 3D skeleton representations. We systematically evaluate the key components of our method. Our experiments on three public datasets demonstrate CASA significantly improves phase progress and Kendall's Tau scores over the previous state-of-the-art methods.
翻译:视频中细微的人类行为在时间上的配合对于计算机视觉、机器人和混杂现实中的许多应用非常重要。 最先进的方法通过利用强大的深层神经神经网络直接学习基于图像的嵌入空间。 其结果虽然不简单,但结果远不令人满意, 相配的视频在时间上表现出严重的不连续性, 而没有额外的处理步骤。 人体和手部最近的进步在野生前景中提出了解决视频中人类行动协调任务的新方法的估计。 在这项工作中,基于现成的人类形象估计器,我们提出了一个新的环境觉悟自我监督学习架构,以协调行动序列。 我们命名CASA。 具体地说, CASA使用自我注意和交叉注意机制来纳入人类行动的空间和时间背景,这可以解决时间不连续问题。 此外,我们引入了一种自我超强的学习计划,通过新型的4D增强技术来增强3D骨架演示。 我们系统地评估了我们的方法的关键组成部分。 我们在三个公共数据集上的实验展示了CASA- TaI的阶段和Kenall的成绩。