Although many approaches for multi-human pose estimation in videos have shown profound results, they require densely annotated data which entails excessive man labor. Furthermore, there exists occlusion and motion blur that inevitably lead to poor estimation performance. To address these problems, we propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. First, our framework composes different combinations of sparsely annotated frames that denote the track of the overall joint movement. We propose an occlusion attention mask from these combinations that enable encoding occlusion-aware heatmaps as a semi-supervised task. Second, the proposed temporal encoder employs transformer architecture to effectively aggregate the temporal relationship and keypoint-wise attention from each time step and accurately refines the target frame's final pose estimation. We achieve state-of-the-art pose estimation results for PoseTrack2017 and PoseTrack2018 datasets and demonstrate the robustness of our approach to occlusion and motion blur in sparsely annotated video data.
翻译:虽然在视频中对多人构成的估算有许多方法都显示了深刻的结果,但它们需要密集的附加说明的数据,这需要过多的人力劳动。此外,存在隐蔽和运动模糊,不可避免地导致估计性能不佳。为了解决这些问题,我们提出了一个方法来利用对使用变压器的框架之间隐蔽的关节和编码的时间依赖的注意面罩。首先,我们的框架组合了稀薄的附加说明的框架的不同组合,这些框架标志着整个联合运动的轨迹。我们建议从这些组合中添加一种隐蔽的注意面罩,以便能够将隐蔽-觉热映像作为半监督的任务。第二,拟议的时间编码器采用变异结构,以便有效地从每个时间步骤中聚合时间关系和关键点的注意,并准确地完善目标框架的最后配置估计。我们实现了对PoseTrack2017和PoseTrack2018的预测结果。我们实现了“状态”的估算结果,并展示了我们对于隐蔽和移动模糊的微细的视频数据所采用的方法的坚固性。