Multi-frame human pose estimation in complicated situations is challenging. Although state-of-the-art human joints detectors have demonstrated remarkable results for static images, their performances come short when we apply these models to video sequences. Prevalent shortcomings include the failure to handle motion blur, video defocus, or pose occlusions, arising from the inability in capturing the temporal dependency among video frames. On the other hand, directly employing conventional recurrent neural networks incurs empirical difficulties in modeling spatial contexts, especially for dealing with pose occlusions. In this paper, we propose a novel multi-frame human pose estimation framework, leveraging abundant temporal cues between video frames to facilitate keypoint detection. Three modular components are designed in our framework. A Pose Temporal Merger encodes keypoint spatiotemporal context to generate effective searching scopes while a Pose Residual Fusion module computes weighted pose residuals in dual directions. These are then processed via our Pose Correction Network for efficient refining of pose estimations. Our method ranks No.1 in the Multi-frame Person Pose Estimation Challenge on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018. We have released our code, hoping to inspire future research.
翻译:复杂情况下的多框架人造假设估计是困难的。尽管最先进的人类联合探测器已经展示了静态图像的显著结果,但其性能在我们将这些模型应用到视频序列中时却很短。 前的缺点包括无法处理运动模糊、视频脱焦、或由于无法捕捉视频框架之间的时间依赖而造成隐蔽。另一方面,直接使用常规的经常性神经网络在空间环境建模方面造成了经验上的困难,特别是在处理姿势隔离方面。在本文中,我们提议了一个新的多框架人造假设估计框架,利用视频框架之间的大量时间提示来便利关键点检测。在我们的框架中设计了三个模块组件组件。一个将关键点编码为关键点对视场的波形环境进行有效搜索,而一个脉冲余压模块则将加权的残余置于双重方向上。然后通过我们的“浮壳校正网络”进行处理,以高效地改进组合估计。我们的方法在“多框架人脉冲动动动脉动动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动脉动图。