Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view non-local network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github.com/zifyloo/SSAN.
翻译:文本到图像人重新身份识别(ReID)的目的是利用文本描述来搜索含有受关注人的图像,然而,由于模式差异很大,而且文本到图像的描述存在巨大的阶级内部差异,文本到图像ReID仍然是一个具有挑战性的问题。因此,在本文件中,我们提议建立一个模拟自成一体的网络,以处理上述问题。首先,我们提出一种新颖的方法,从两种模式中自动提取语义一致的部位特征。第二,我们设计了一个多视非本地网络,捕捉身体部分之间的关系,从而在身体部分和名词词之间建立更好的对应。第三,我们引入一种复合排层(CR)损失,利用同一身份的其他图像的文字描述来提供额外的监督,从而有效减少语言特征中的阶级内部差异。最后,为了加快对文本到图像ReID的今后研究,我们建立了一个名为ICFG-PEDES的新数据库。广泛的实验表明,SSAN超越状态-艺术部分和名词词词词词词词词词组。我们引入了一个复合排行式分级(CRR)损失,而新的ICFGGG/MESAG/MEPEDEDER系统数据库都是可用的重要边距。