In this paper, we are interested in the bottom-up paradigm of estimating human poses from an image. We study the dense keypoint regression framework that is previously inferior to the keypoint detection and grouping framework. Our motivation is that regressing keypoint positions accurately needs to learn representations that focus on the keypoint regions. We present a simple yet effective approach, named disentangled keypoint regression (DEKR). We adopt adaptive convolutions through pixel-wise spatial transformer to activate the pixels in the keypoint regions and accordingly learn representations from them. We use a multi-branch structure for separate regression: each branch learns a representation with dedicated adaptive convolutions and regresses one keypoint. The resulting disentangled representations are able to attend to the keypoint regions, respectively, and thus the keypoint regression is spatially more accurate. We empirically show that the proposed direct regression method outperforms keypoint detection and grouping methods and achieves superior bottom-up pose estimation results on two benchmark datasets, COCO and CrowdPose. The code and models are available at https://github.com/HRNet/DEKR.
翻译:在本文中,我们感兴趣的是从图像中估算人的容貌的自下而上模式。我们研究了以前低于关键点探测和组合框架的密集关键点回归框架。我们的动机是,倒退的关键点位置准确需要学习以关键点区域为重点的表达方式。我们提出了一个简单而有效的方法,名为分解的关键点回归(DEKR),我们通过像素智能空间变异器采用适应性共变法,以激活关键点区域的像素,并由此从中学习表达方式。我们使用一个多分支结构来分别回归:每个分支都学习了专门的适应性共变和回归的表达方式。由此产生的分解式表达方式能够分别关注关键点区域,因此关键点回归在空间上更为准确。我们从经验上表明,拟议的直接回归方法比关键点检测和组合方法更完善了关键点检测和组合方法,并实现了两个基准数据集(COCOCO和CrowdPose)的自下而上层的估算结果。代码和模型可以在 https://githrub.com/Net/DROKU) 上查阅。