To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.
翻译:为了实现更准确的 2D 人造图象估计,我们以三种方式扩展成功的编码器-解码器网络,即简单的基线网络(SBN),以减少由大型输出梯度大小造成的量化错误。为了减少由大型输出梯度大小造成的量化错误,在简单基线网络的结尾处附加了两个额外的解码器模块,以获得完整的输出分辨率。然后,在编码器和解码器模块中添加了全球环境区块(GCBs),以利用全球背景特征增强这些区块。此外,我们提议采用一个新的基于空间的注意的多尺度功能收集和分发模块(SA-MFCD),以整合和分发多尺度的特性,以提升外观估计。 MS COCO数据集的实验结果表明,我们的网络可以显著提高比SBN的人类外观估计的准确性。 我们的网络使用ResNet34作为主干网甚至可以达到SBN ResNet152的精确性,我们的网络可以通过大型主干网取得优效果。