Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
翻译:人类剖析的目的是将人类在图像或视频中分割成多个像素级语义部分。过去十年,人类对计算机视觉界的兴趣大大增加,并被广泛实际应用利用,从安全监测、社交媒体到视觉特效等一系列广泛的实际应用,仅举几个例子。虽然深层次的基于学习的人类剖析解决方案取得了显著成就,但许多重要概念、现有挑战和潜在的研究方向仍然令人困惑。在本次调查中,我们全面审查了三个核心子任务:单一人类剖析、多重人类剖析和视频人类对等,方法是介绍各自的任务设置、背景概念、相关问题和应用、代表性文献和数据集。我们还介绍了基准数据集审查方法的定量业绩比较。此外,为了促进社区的可持续发展,我们提出了一个基于变异器的人类剖析框架,为通过普遍、简洁和可扩展的解决方案开展后续研究提供了一个高绩效基准。最后,我们指出了一个在本领域进行深入调查的开放式问题集,并提出了最新的应用、具有代表性的文献和应用、具有代表性的文献集成图案。我们还提供了一个不断更新的实地项目。我们还提供了一个不断推进未来研究的新方向。