Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of people sharing the same traits. In a joint embedding space of the two modalities, our loss pulls images close to their person categories for modality alignment. More importantly, it pushes apart a pair of person categories by a margin determined adaptively by their semantic distance, where the distance metric is learned end-to-end so that the loss considers importance of each attribute when relating person categories. Our loss guided by the adaptive semantic margin leads to more discriminative and semantically well-arranged distributions of person images. As a consequence, it enables a simple embedding model to achieve state-of-the-art records on public benchmarks without bells and whistles.
翻译:以属性为基础的人搜索是寻找与一组文字属性作为查询最匹配的人图像的任务。 这项任务的主要挑战在于属性和图像之间的巨大模式差异。 为了缩小差距, 我们为学习基于属性的人搜索中的跨模式嵌入提供了新的损失。 我们把一组属性视为具有相同特性的人群的类别。 在两种模式的联合嵌入空间中, 我们的丢失将图像引向接近其个人类型的模式匹配。 更重要的是, 它将一对个人类别拉开, 其差幅由以其语义距离为调整决定, 远度测量是学习的端到端, 从而在相关个人类别中考虑每个属性的重要性。 我们受适应语义边际引导的损失导致个人图像的更具有歧视性和语义性分布的分布。 结果, 它使得一个简单的嵌入模型能够在没有钟和哨子的情况下在公共基准上实现最先进的记录。