Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks. In this paper, we propose VAL-PAT, a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. To train our framework, we introduce three learning objectives, \emph{i.e.,} self-supervised contrastive learning, image-text contrastive learning and multi-attribute classification. The self-supervised contrastive learning facilitates the learning of the intrinsic pedestrian properties, while the image-text contrastive learning guides the model to focus on the appearance information of pedestrians.Meanwhile, multi-attribute classification encourages the model to recognize attributes to excavate fine-grained pedestrian information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations, and then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search. Extensive experiments demonstrate that our framework facilitates the learning of general pedestrian representations and thus leads to promising results on various pedestrian analysis tasks.
翻译:最近关于无监督人员重新识别(reID)的研究表明,在未标记的人员图像上进行预训练比在ImageNet上进行预训练对下游reID任务的性能表现更好。然而,这些预训练方法是专门为reID设计的,对其他行人分析任务的灵活适应性较差。在本文中,我们提出了VAL-PAT,一种学习可传递表示以增强各种行人分析任务的多模态信息的新框架。为了训练我们的框架,我们引入了三个学习目标,即自监督对比学习,图像 - 文本对比学习和多属性分类。自监督对比学习有助于学习内在行人属性,而图像 - 文本对比学习则指导模型专注于行人外观信息。同时,多属性分类鼓励模型识别属性以挖掘细粒度的行人信息。我们首先在LUPerson-TA数据集上进行预训练,其中每个图像都包含文本和属性注释,然后将学习到的表示传输到各种下游任务中,包括人员reID,人员属性识别和基于文本的人员搜索。广泛的实验表明,我们的框架有助于学习通用的行人表示,从而在各种行人分析任务上带来了良好的结果。