We propose to improve on graph convolution based approaches for human shape and pose estimation from monocular input, using pixel-aligned local image features. Given a single input color image, existing graph convolutional network (GCN) based techniques for human shape and pose estimation use a single convolutional neural network (CNN) generated global image feature appended to all mesh vertices equally to initialize the GCN stage, which transforms a template T-posed mesh into the target pose. In contrast, we propose for the first time the idea of using local image features per vertex. These features are sampled from the CNN image feature maps by utilizing pixel-to-mesh correspondences generated with DensePose. Our quantitative and qualitative results on standard benchmarks show that using local features improves on global ones and leads to competitive performances with respect to the state-of-the-art.
翻译:我们建议改进基于图形的人类形状变异方法,并使用单镜输入法,使用像素匹配的本地图像特征进行估算。根据单一输入色图像,现有基于图形变异网络的人类形状技术,并使用所有网状脊椎所附的单一进化神经网络产生的全球图像特征进行估算,以同样方式启动GCN阶段,将T-pos网状模板转换成目标形状。相反,我们首次提出使用每个顶端的本地图像特征的想法。这些特征通过使用与DensePose生成的像素到网的相通信,从CNN图像特征地图中取样。我们在标准基准上的定量和定性结果显示,使用本地特征可以改善全球特征,并导致在最新技术方面有竞争力的表现。