Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER.
翻译:人类视觉任务因其广泛的应用受到越来越多的关注。本文旨在从大规模无标签的人类图像中学习出适用于各种人类视觉任务的人类图像表示。我们称此方法为SOLIDER,即一种语义可控的自监督学习框架。与现有的自监督学习方法不同,SOLIDER利用人类图像中的先验知识,构建伪语义标签,将更多的语义信息导入到所学到的表示中。同时,我们注意到不同的下游任务需要不同比例的语义信息和外观信息。例如,人物分割需要更多的语义信息,而人员重识别需要更多的外观信息以进行身份识别。因此,单一的学习表示无法适应所有要求。为了解决这个问题,SOLIDER引入了带有语义控制器的条件网络。在模型训练完成后,用户可以向控制器发送值,以产生具有不同比例语义信息的表示,适应下游任务的不同需求。最后,SOLIDER在六个人类视觉任务上进行验证。它优于现有的艺术水平,并为这些任务建立了新的基准线。该代码在https://github.com/tinyvision/SOLIDER上开源。