Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query. Solving such a fine-grained cross-modal retrieval task is challenging, which is further hampered by the lack of large-scale datasets. In this paper, we present a framework with two novel components to handle the problems brought by limited data. Firstly, to fully utilize the existing small-scale benchmarking datasets for more discriminative feature learning, we introduce a cross-modal momentum contrastive learning framework to enrich the training data for a given mini-batch. Secondly, we propose to transfer knowledge learned from existing coarse-grained large-scale datasets containing image-text pairs from drastically different problem domains to compensate for the lack of TBPS training data. A transfer learning method is designed so that useful information can be transferred despite the large domain gap. Armed with these components, our method achieves new state of the art on the CUHK-PEDES dataset with significant improvements over the prior art in terms of Rank-1 and mAP. Our code is available at https://github.com/BrandonHanx/TextReID.
翻译:基于文字的人搜索(TBPS)旨在用描述性文字查询从图像画廊检索目标人物。 解决这种细微的跨模式检索任务具有挑战性,因为缺乏大型数据集进一步阻碍这项工作。 在本文中,我们提出了一个框架,其中有两个新组成部分来处理有限数据带来的问题。 首先,为了充分利用现有的小规模基准数据集进行更具有歧视性的特点学习,我们引入了一个跨模式动力对比学习框架,以丰富某个特定微型批次的培训数据。 其次,我们提议转让从现有包含大量不同问题领域图像文本的粗略大型数据集中获取的知识,以弥补缺乏TBPS培训数据的问题。我们设计了一个传输学习方法,以便尽管存在巨大的域间隔,仍然可以转让有用的信息。我们的方法在CUHK-PEDES数据集上实现了新的艺术状态,大大改进了先前的兰克-1和 mAP的艺术。我们的代码可在 https://github.com/BrandHon.