Human pose estimation deeply relies on visual clues and anatomical constraints between parts to locate keypoints. Most existing CNN-based methods do well in visual representation, however, lacking in the ability to explicitly learn the constraint relationships between keypoints. In this paper, we propose a novel approach based on Token representation for human Pose estimation~(TokenPose). In detail, each keypoint is explicitly embedded as a token to simultaneously learn constraint relationships and appearance cues from images. Extensive experiments show that the small and large TokenPose models are on par with state-of-the-art CNN-based counterparts while being more lightweight. Specifically, our TokenPose-S and TokenPose-L achieve 72.5 AP and 75.8 AP on COCO validation dataset respectively, with significant reduction in parameters ($\downarrow80.6\%$ ; $\downarrow$ $56.8\%$) and GFLOPs ($\downarrow$$ 75.3\%$; $\downarrow$ $24.7\%$).
翻译:人类的估测深度依赖于视觉线索和各个部分之间的解剖限制以定位关键点。 大部分现有的有线电视新闻网方法在视觉表现方面表现良好, 但是缺乏明确了解关键点之间制约关系的能力。 在本文中,我们提议了一种基于人类脉冲估计Token表示法的新办法。 详细来说, 每一个关键点都明确嵌入为同时学习限制关系和图像外观提示的象征。 广泛的实验显示, 小和大 TokenPose 模式与最先进的有线电视新闻网的对应方相当,但更轻。 具体地说, 我们的TokenPose- S 和 TokenPose-L 分别实现了72.5 AP 和 75.8 AP CO 验证数据集, 参数大幅降低 ($\ downrowrow80.6 $; $\ downrowrow $ 56.8 $) 和 GFLROPs ($\ downrow $ 75.3 $; $\\ downrowlorral $ 24.7 $)。