Estimating 3D human pose from a single image suffers from severe ambiguity since multiple 3D joint configurations may have the same 2D projection. The state-of-the-art methods often rely on context modeling methods such as pictorial structure model (PSM) or graph neural network (GNN) to reduce ambiguity. However, there is no study that rigorously compares them side by side. So we first present a general formula for context modeling in which both PSM and GNN are its special cases. By comparing the two methods, we found that the end-to-end training scheme in GNN and the limb length constraints in PSM are two complementary factors to improve results. To combine their advantages, we propose ContextPose based on attention mechanism that allows enforcing soft limb length constraints in a deep network. The approach effectively reduces the chance of getting absurd 3D pose estimates with incorrect limb lengths and achieves state-of-the-art results on two benchmark datasets. More importantly, the introduction of limb length constraints into deep networks enables the approach to achieve much better generalization performance.
翻译:从单一图像中估算3D人造面的3D人造图象存在严重的模糊性,因为多个3D组合组合可能具有相同的2D投影。最先进的方法往往依靠图片结构模型或图形神经网络等环境模型方法来减少模糊性。然而,没有一项研究能够同时严格地比较它们。因此,我们首先提出了一个背景模型的一般公式,其中PSM和GNN都是其特例。通过比较这两种方法,我们发现,GNN的端到端培训计划和PSM的四肢长度限制是改进结果的两个互补因素。为了将其优势结合起来,我们提议基于关注机制的 " 环境模型 ",这种机制允许在深网络中实施软肢长度限制。这种方法有效地减少了获得荒谬的3D提出假肢长度估计的机会,并在两个基准数据集中实现最先进的结果。更重要的是,在深网络中引入四肢长度限制使方法能够实现更好的普及性工作。