Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space mapping between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments are time-consuming. 2) Attention methods can explore salient cross-modal alignments but may ignore some subtle and valuable pairs. To relieve these issues, we introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval. Different from previous models, IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction. To explore the fine-grained alignment, we further propose two implicit semantic alignment paradigms: multi-level alignment (MLA) and bidirectional mask modeling (BMM). The MLA module explores finer matching at sentence, phrase, and word levels, while the BMM module aims to mine \textbf{more} semantic alignments between visual and textual modalities. Extensive experiments are carried out to evaluate the proposed IVT on public datasets, i.e., CUHK-PEDES, RSTPReID, and ICFG-PEDES. Even without explicit body part alignment, our approach still achieves state-of-the-art performance. Code is available at: https://github.com/TencentYoutuResearch/PersonRetrieval-IVT.
翻译:以文字为基础的个人检索旨在根据文本描述查找查询人。 关键是要在视觉- 文字模式之间学习共同的潜在空间绘图。 为了实现这一目标, 现有的工作使用分解来获得明确的跨模式对齐, 或者利用注意力来探索显著的对齐。 这些方法有两个缺点:(1) 标记跨模式对齐需要时间。 (2) 注意方法可以探索显著的跨模式对齐,但可能会忽略一些微妙和有价值的对称。 为了缓解这些问题, 我们为基于文字的人检索引入了一个隐含的视觉- extual( IVT) 框架。 不同于以前的模型, IVT使用一个单一的网络来学习两种模式的演示, 这有助于视觉- 文本互动。 为了探索细化的对齐调, 我们进一步提议了两种隐含的语义调整模式: 多层次对齐( MLA) 和双向层遮掩码( BMM)。 MAL 模块在句、 语系和词层对齐, 而BMM 模块的目标是在不进行文字- textflearal- IV; remartial- SIalalalalalalalalalalal ex- exal contragismal sal sal sal sal supalislations.