Video-based human pose estimation (VHPE) is a vital yet challenging task. While deep learning methods have made significant progress for the VHPE, most approaches to this task implicitly model the long-range interaction between joints by enlarging the receptive field of the convolution. Unlike prior methods, we design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically. The JRE takes the pseudo heatmaps of joints as input and calculates the similarity between pseudo heatmaps. In this way, the JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses. Moreover, the JRE can infer invisible joints according to the relationship between joints, which is beneficial for the model to locate occluded joints. Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation. Specifically, to capture the temporal dynamics of poses, the pose semantic information of the current frame is transferred to the next with a joint relation guided pose semantics propagator (JRPSP). The proposed model can transfer the pose semantic features from the non-occluded frame to the occluded frame, making our method robust to the occlusion. Furthermore, the proposed JRE module is also suitable for image-based human pose estimation. The proposed RPSTN achieves state-of-the-art results on the video-based Penn Action dataset, Sub-JHMDB dataset, and PoseTrack2018 dataset. Moreover, the proposed JRE improves the performance of backbones on the image-based COCO2017 dataset. Code is available at https://github.com/YHDang/pose-estimation.
翻译:以视频为基础的人类表面估计( VHPE) 是一项至关重要但具有挑战性的任务。 虽然深层学习方法已经为 VHPE 取得了显著进步, 但大部分任务方法都隐含了通过扩大 convolution 的可接受领域来模拟联合点之间的远程互动。 与以往的方法不同, 我们设计了一个轻量和插插和播放联合关系提取器( JRE) 来以明确和自动的方式模拟联合点之间的关联关系。 JRE 将假联合的热谱图解图作为输入, 并计算假热图之间的相似性。 这样, JRE 灵活地了解了任何两个联合点之间的关系, 从而可以让它学习人类组合的丰富空间配置。 此外, JRE 可以根据联合点之间的关系来推断隐隐蔽的连接点。 然后, 加上时间的 语系连续性的连续模型, 我们提议基于图像的 JSEM 20 图像传输网络, 用于基于视频的人类的图像估计。 具体地, 定位, 将当前框架的可配置的磁带数据转换为 数据框架, 将数据转换为 。