Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination and scale variance. In this paper, we address this problem from three aspects by devising an efficient network structure, proposing three effective training strategies, and exploiting four useful postprocessing techniques. First, we find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer (CCM), which efficiently integrates spatial and channel context information and progressively refines them. Then, to maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy by exploiting abundant unlabeled data. It enables CCM to learn discriminative features from massive diverse poses. Third, we present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy. Extensive experiments on the MS COCO keypoint detection benchmark demonstrate the superiority of the proposed method over representative state-of-the-art (SOTA) methods. Our single model achieves comparable performance with the winner of the 2018 COCO Keypoint Detection Challenge. The final ensemble model sets a new SOTA on this benchmark.
翻译:由于封闭、模糊、照明和规模差异,从单一图像中检测人类关键点非常具有挑战性。在本文件中,我们从三个方面解决这一问题:设计一个高效的网络结构,提出三个有效的培训战略,并利用四种有用的后处理技术。第一,我们发现,背景信息在推理人体结构配置和无形关键点方面发挥着重要作用。受此启发,我们提议了一个级联背景混音器(CCCM),有效地整合空间和频道背景信息,并逐步完善这些信息。然后,为了最大限度地发挥CCM的代表性能力,我们开发了一个硬性负人探测采矿战略和联合培训战略,利用丰富的无标签数据。它使CCM能够从巨大的多种外形中学习歧视性特征。第三,我们介绍了用于后处理关键点预测的分像精度改进技术,以提高检测准确性。关于MS COCOCO关键点检测基准的广泛实验表明拟议方法优于具有代表性的状态(SOTA)方法。我们的一个单一模型取得了与2018年CO关键点检测基准的赢家的类似性业绩。