Human pose estimation (HPE) usually requires large-scale training data to reach high performance. However, it is rather time-consuming to collect high-quality and fine-grained annotations for human body. To alleviate this issue, we revisit HPE and propose a location-free framework without supervision of keypoint locations. We reformulate the regression-based HPE from the perspective of classification. Inspired by the CAM-based weakly-supervised object localization, we observe that the coarse keypoint locations can be acquired through the part-aware CAMs but unsatisfactory due to the gap between the fine-grained HPE and the object-level localization. To this end, we propose a customized transformer framework to mine the fine-grained representation of human context, equipped with the structural relation to capture subtle differences among keypoints. Concretely, we design a Multi-scale Spatial-guided Context Encoder to fully capture the global human context while focusing on the part-aware regions and a Relation-encoded Pose Prototype Generation module to encode the structural relations. All these works together for strengthening the weak supervision from image-level category labels on locations. Our model achieves competitive performance on three datasets when only supervised at a category-level and importantly, it can achieve comparable results with fully-supervised methods with only 25\% location labels on MS-COCO and MPII.
翻译:人类表面估计(HPE)通常需要大规模培训数据才能达到高性能。然而,收集高质量的和精细的人体说明是很费时的。为了缓解这一问题,我们重新审视HPE, 并提议一个不监督关键点位置的无位置框架。我们从分类的角度重新配置基于回归的HPE。在基于CAM的薄弱监督对象定位的启发下,我们观察到粗糙的关键点点点点可以通过部分觉醒的CAMs获得,但由于精细的HPE和目标级本地化之间的差距而不能令人满意。为此,我们提议一个定制的变异器框架,以便在没有严格监督关键点之间微妙差异的情况下,清除精细的人类环境代表。具体地说,我们设计了一个多尺度的空间引导背景连接器,以充分捕捉全球人类环境,同时只侧重于部分觉悟区域和重新编码的Pose PrototyDation模块来记录结构模型。所有这些变异的变异器框架都是为了在完全的级别上加强竞争性的MS-25级标签,因此,在完全的级别上,只有经过监督的级别上,才能在有较弱的级别上达到有竞争力的MS-级标签。