To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner. It is a compact, efficient and powerful framework that exploits structural information over different human granularities and eases the difficulty of person partitioning. Specifically, a dense-to-sparse projection field, which allows explicitly associating dense human semantics with sparse keypoints, is learnt and progressively improved over the network feature pyramid for robustness. Then, the difficult pixel grouping problem is cast as an easier, multi-person joint assembling task. By formulating joint association as maximum-weight bipartite matching, a differentiable solution is developed to exploit projected gradient descent and Dykstra's cyclic projection algorithm. This makes our method end-to-end trainable and allows back-propagating the grouping error to directly supervise multi-granularity human representation learning. This is distinguished from current bottom-up human parsers or pose estimators which require sophisticated post-processing or heuristic greedy algorithms. Experiments on three instance-aware human parsing datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
翻译:为了应对对人性部分进行分解这一具有挑战性的任务,提出了一个新的自下而上的制度,以学习分类级人类语义分解以及多人以联合和端到端的方式作出估计。这是一个紧凑、高效和强大的框架,利用结构信息,覆盖不同的人类颗粒,减轻人间分解的困难。具体地说,一个密集到稀疏的投影场,可以将密集的人类语义与稀疏的关键点明确联系起来,在网络特征金字塔上学习和逐步改进。然后,困难的像素组别问题被描绘成一个更容易、多人联合集合的任务。通过将联合协会设计成最大重量的双边配对,可以开发一个不同的解决方案,利用预测的梯度下降和Dykstrastra的循环预测算法。这使得我们的方法端对端可训练,并允许对组合错误进行回调,直接监督多度人类代表性的学习。这与当前底部人类分解剖师或构成高度模型分析师之间的联合任务不同,这需要更精密的人类后期或更精密的实验性分析。