Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the feature alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature representations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head attention module constrained by a cross-modality part alignment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
翻译:基于文字的人搜索旨在用文字描述检索某个行人的照片。 这项任务的主要挑战是消除时空差异,实现不同模式的特征一致。 在本文中,我们建议了一种基于文字的人搜索的语义一致嵌入方法,通过自动学习语义一致的视觉特征和文字特征实现不同模式的特征一致。 首先, 我们引入了两种基于变换的骨干, 以编码图像和文字的稳健特征表现。 其次, 我们设计了一个语义一致特征组合网络, 以适应性地选择和综合具有相同语义特征的特征, 将其整合成部分认知特征, 实现该功能的是一个多头关注模块, 受跨时态一致损失和多样性损失的制约。 CUHK- PEDES和Flick30K数据集的实验结果显示, 我们的方法达到了最先进的性能。