We present a novel and effective method calibrating cross-modal features for text-based person search. Our method is cost-effective and can easily retrieve specific persons with textual captions. Specifically, its architecture is only a dual-encoder and a detachable cross-modal decoder. Without extra multi-level branches or complex interaction modules as the neck following the backbone, our model makes a high-speed inference only based on the dual-encoder. Besides, our method consists of two novel losses to provide fine-grained cross-modal features. A Sew loss takes the quality of textual captions as guidance and aligns features between image and text modalities. A Masking Caption Modeling (MCM) loss uses a masked captions prediction task to establish detailed and generic relationships between textual and visual parts. We show the top results in three popular benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReID. In particular, our method achieves 73.81% Rank@1, 74.25% Rank@1 and 57.35% Rank@1 on them, respectively. In addition, we also validate each component of our method with extensive experiments. We hope our powerful and scalable paradigm will serve as a solid baseline and help ease future research in text-based person search.
翻译:我们提出了一种新颖有效的方法,对文本搜索中的跨模态特征进行校准。我们的方法经济实用,可以便捷地通过文本标签检索特定的人物。具体来说,我们的模型仅由双编码器和可拆卸的跨模态解码器构成,没有额外的多级分支或复杂的交互模块,完全基于双编码器实现高速推断。此外,我们的方法包括两个新型损失,提供细粒度的跨模态特征。Sew 损失以文本标题的质量为指导,对图像和文本模态之间的特征进行对齐。Masking Caption Modeling(MCM)损失使用带掩码的标题预测任务,在文本和视觉部分之间建立详细而通用的关系。我们在三个流行的基准数据集(CUHK-PEDES、ICFG-PEDES和RSTPReID)上展示了顶级结果。特别是,在这些基准数据集中,我们的方法分别达到了73.81% Rank@1、74.25% Rank@1和57.35% Rank@1的排名。此外,我们还通过大量实验证明了我们方法的每个组成部分。我们希望我们强大而可扩展的模型可以作为一个坚实的基础线,并有助于简化未来文本搜索中的研究。