In this work, we propose a new solution to 3D human pose estimation in videos. Instead of directly regressing the 3D joint locations, we draw inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived. Our motivation is the fact that the bone lengths of a human skeleton remain consistent across time. This promotes us to develop effective techniques to utilize global information across all the frames in a video for high-accuracy bone length prediction. Moreover, for the bone direction prediction network, we propose a fully-convolutional propagating architecture with long skip connections. Essentially, it predicts the directions of different bones hierarchically without using any time-consuming memory units e.g. LSTM). A novel joint shift loss is further introduced to bridge the training of the bone length and bone direction prediction networks. Finally, we employ an implicit attention mechanism to feed the 2D keypoint visibility scores into the model as extra guidance, which significantly mitigates the depth ambiguity in many challenging poses. Our full model outperforms the previous best results on Human3.6M and MPI-INF-3DHP datasets, where comprehensive evaluation validates the effectiveness of our model.
翻译:在这项工作中,我们为视频中的3D人的外形估计提出了一个新的解决方案。我们不直接倒退3D联合地点,而是从人体骨骼解剖和将任务分解为骨质方向预测和骨长度预测,可以完全从中得出3D联合地点。我们的动机是人体骨骼的骨骼长度在时间上保持一致。这促使我们开发有效的技术,在高精度骨头长度预测的视频中利用所有框架的全球信息。此外,对于骨质方向预测网络,我们建议了一个具有长期跳过连接的全横向传播结构。基本上,它预测了不同骨骼的分层方向,而没有使用任何耗时的记忆单位,例如LSTM。我们引入了一个新的联合转移损失,以连接骨骼长度和骨质方向预测网络的培训。最后,我们使用隐含的注意机制将2D关键点的可见度分数作为额外的指导输入模型,这大大降低了许多挑战性外表的深度。我们的完整模型比人类3.6M和MP-INF-3P数据验证系统的最佳结果。