Text-based person search is a challenging task that aims to search pedestrian images with the same identity from the image gallery given a query text description. In recent years, text-based person search has made good progress, and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, the existing methods explicitly extract image parts and text phrases from images and texts by hand-crafted split or external tools and then conduct complex cross-modal local matching. Moreover, the existing methods seldom consider the problem of information inequality between modalities caused by image-specific information. In this paper, we propose an efficient joint Information and Semantic Alignment Network (ISANet) for text-based person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guide localization and channel attention filtration respectively. This design can effectively alleviate the problem of information inequality and realize the information alignment between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate image and text features to a set of modality-shared semantic topic centers, and implicitly learn the local fine-grained correspondence between images and texts without additional supervision information and complex cross-modal interactions. Moreover, a global alignment is introduced as a supplement to the local perspective. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of the proposed ISANet.
翻译:文本搜索是一项具有挑战性的任务,目的是从带有查询文本描述的图像库中搜索具有相同身份的行人图像。近年来,基于文本的人搜索取得了良好进展,最先进的方法通过学习当地图像和文本之间的细微对应来取得优异的性能。然而,现有方法明确通过手工制作的分裂或外部工具从图像和文本中提取图像部分和文字短语,然后进行复杂的跨式本地匹配。此外,现有方法很少考虑到图像特定信息造成模式之间信息不平等的问题。在本文件中,我们建议建立一个高效的信息和语义协调网络(ISANet)网络(ISANet),用于基于文本的人搜索。具体地说,我们首先设计一个针对图像的特定信息抑制模块,通过分别对地方定位和注意力过滤来抑制图像背景和环境因素。这一设计可以有效地缓解信息不平等问题,实现图像和文本之间的信息协调。我们建议一个隐含的本地调整模块,以适应性综合图像和文本特征为一套模式共享的主题主题中心(ISNetNet ) 。我们首先设计一个针对图像和复杂版本的跨式图像的系统化分析,然后对当地文本进行隐性地学习。