A key assumption of top-down human pose estimation approaches is their expectation of having a single person present in the input bounding box. This often leads to failures in crowded scenes with occlusions. We propose a novel solution to overcome the limitations of this fundamental assumption. Our Multi-Hypothesis Pose Network (MHPNet) allows for predicting multiple 2D poses within a given bounding box. We introduce a Multi-Hypothesis Attention Block (MHAB) that can adaptively modulate channel-wise feature responses for each hypothesis and is parameter efficient. We demonstrate the efficacy of our approach by evaluating on COCO, CrowdPose, and OCHuman datasets. Specifically, we achieve 70.0 AP on CrowdPose and 42.5 AP on OCHuman test sets, a significant improvement of 2.4 AP and 6.5 AP over the prior art, respectively. When using ground truth bounding boxes for inference, MHPNet achieves an improvement of 0.7 AP on COCO, 0.9 AP on CrowdPose, and 9.1 AP on OCHuman validation sets compared to HRNet. Interestingly, when fewer, high confidence bounding boxes are used, HRNet's performance degrades (by 5 AP) on OCHuman, whereas MHPNet maintains a relatively stable performance (a drop of 1 AP) for the same inputs.
翻译:人类自上而下人身估计方法的一个关键假设是,它们期望在输入约束框中有一个单一的人在场,这往往导致在拥挤的人群中出现失败。我们提出了克服这一基本假设的局限性的新解决办法。我们的多双球西Pose网络(MHPNet)允许在给定约束框中预测多重2D构成。我们引入了一个多重双球关注区(MHAB),可以适应性地调节每种假设的频道功能响应,并且是高效的参数。我们通过对COCO、CrowdPose和OCHuman数据集进行评估,展示了我们的方法的有效性。具体地说,我们在CrowdPose和OCHuman测试集中实现了70.0 AP和42.5 AP,分别大大改进了在特定约束框中预测多个2D。在使用地面真相约束框进行推断时,MHPNet实现了对CO的0.7 、CrowdPose 0.9 AP 和 OOC 校准组的9 AP 与HRNet 数据集的改进。 有趣的是,当AP MA-RO MA-RO-RO-S MA-S-S-S MA-S-MA-MA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-MAD-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-SA-MA-SA-SA-SA-SA-SA-SA-MAD-MAD-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-SA-SA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-SA-MA-MA-MA-MA-MA-MA-MA-MA-SA-SA-SA-SA-SA-MA-SA-SA-MA-MA-SA-SA-SA-SA-SA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-SA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-MA-SA-SA-