A key assumption of top-down human pose estimation approaches is their expectation of having a single person/instance present in the input bounding box. This often leads to failures in crowded scenes with occlusions. We propose a novel solution to overcome the limitations of this fundamental assumption. Our Multi-Instance Pose Network (MIPNet) allows for predicting multiple 2D pose instances within a given bounding box. We introduce a Multi-Instance Modulation Block (MIMB) that can adaptively modulate channel-wise feature responses for each instance and is parameter efficient. We demonstrate the efficacy of our approach by evaluating on COCO, CrowdPose, and OCHuman datasets. Specifically, we achieve 70.0 AP on CrowdPose and 42.5 AP on OCHuman test sets, a significant improvement of 2.4 AP and 6.5 AP over the prior art, respectively. When using ground truth bounding boxes for inference, MIPNet achieves an improvement of 0.7 AP on COCO, 0.9 AP on CrowdPose, and 9.1 AP on OCHuman validation sets compared to HRNet. Interestingly, when fewer, high confidence bounding boxes are used, HRNet's performance degrades (by 5 AP) on OCHuman, whereas MIPNet maintains a relatively stable performance (drop of 1 AP) for the same inputs.
翻译:人类自上而下变形估计方法的一个关键假设是,它们期望在输入约束框中有一个单一的人/整体存在,这往往导致在拥挤的人群中出现失败。我们提出了克服这一基本假设的局限性的新解决办法。我们的多动脉波网(MIPNet)允许预测多动2D在给定约束框中构成实例。我们引入了一个多动动动模块(MIMB),可以适应性地调节每个实例的频道-功能响应,并且是高效的参数。我们通过对COCO、CrowdPose和OChuman数据集进行评估,展示了我们的方法的功效。具体地说,我们在CrowdPose和OChuman测试集中实现了70.0 AP和42.5 AP,分别大大改进了2.4 AP和6.5 AP 在使用地面真相约束框进行推断时,MIP网实现了对CO的0.7 AP、 0.9 AC on CrowdPose 和 AP 与HRNet 校准数据集的改进。有趣的是,当AP 1-RO 使用相对稳定的业绩分析箱时,而AP 1 MA-ROP 使用相对稳定的业绩压压式的AP 。