We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
翻译:我们从单个图像中提出直接的、基于回归的2D人构成估计方法。 我们将问题作为一个序列预测任务来制定,我们用变异器网络来解决。 这个网络直接从图像到关键点坐标学习回归映射,而不用使用热映射等中间表示法。 这个方法避免了热映射方法的许多复杂之处。 为了克服以往基于回归的估算方法的特征不匹配问题, 我们建议了一个关注机制, 适应与目标关键点最相关的特征, 大大改进准确性。 重要的是, 我们的框架是端到端的, 并且自然地学会利用关键点之间的依赖性。 对MS-CO和MPII的实验, 两种主要的方位估测图数据集, 表明我们的方法大大改进了基于回归的估测法的状态。 更值得注意的是, 我们的方法是第一个基于回归的方法, 与基于最佳热映射的估测法相比, 表现得更优。