Text-based Person Search (TPS), is targeted on retrieving pedestrians to match text descriptions instead of query images. Recent Vision-Language Pre-training (VLP) models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains. However, existing TPS methods improved by VLP only utilize pre-trained visual encoders, neglecting the corresponding textual representation and breaking the significant modality alignment learned from large-scale pre-training. In this paper, we explore the full utilization of textual potential from VLP in TPS tasks. We build on the proposed VLP-TPS baseline model, which is the first TPS model with both pre-trained modalities. We propose the Multi-Integrity Description Constraints (MIDC) to enhance the robustness of the textual modality by incorporating different components of fine-grained corpus during training. Inspired by the prompt approach for zero-shot classification with VLP models, we propose the Dynamic Attribute Prompt (DAP) to provide a unified corpus of fine-grained attributes as language hints for the image modality. Extensive experiments show that our proposed TPS framework achieves state-of-the-art performance, exceeding the previous best method by a margin.
翻译:以文字为基础的个人搜索(TPS),目标是重新找行人,以匹配文本描述而不是查询图像。最近的VV-Language 培训前前(VLP)模型可以将知识转移给下游TPS任务,从而产生更有效的绩效收益。然而,VLP改进的现有TPS方法只使用经过预先培训的视觉编码器,忽视相应的文本代表器,并打破从大规模培训前培训中吸取的重要模式调整。在本文中,我们探索在TPS任务中充分利用VLP的文本潜力。我们以拟议的VLP-TPS基线模型为基础,这是第一个具有预先培训模式的TPS模型。我们建议采用多特征描述限制(MIDC),以加强文本模式的稳健性,在培训期间纳入微分体的不同组成部分。我们建议采用与VLPP模式零分分分分解的迅速方法,我们提议采用动态属性提示(DAP),以统一的精细属性组合作为图像模式的语言提示。我们提出的TPS-MLS框架将实现最佳业绩,范围超过我们提议的TPS-MS-MLUT-MLUTUT。</s>