Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision, which plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. However, previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training, which easily leads to poor generalization capability when adapted to the new domain. In this paper, we propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning on visual, textual and visual-textual tasks respectively. To further enhance the robust feature learning in the context of transformer, a dynamic masking mechanism called Masked Multimodal Modeling strategy (MMM) is introduced to mask both the image patches and the text tokens, which can jointly works on multimodal or unimodal data and significantly boost the performance of generalizable person Re-ID. Extensive experiments on benchmark datasets demonstrate the competitive performance of our method over previous approaches. We hope this method could advance the research towards visual-semantic representation learning. Our source code is also publicly available at https://github.com/JeremyXSC/MMET.
翻译:具有推广能力的人员再识别(Re-ID)是机器学习和计算机视觉领域中非常热门的研究课题之一,由于其在公共安全和视频监控等各种应用中发挥着重要作用,因此具有极高的实际价值。然而,早期的方法主要关注视觉表征学习,而忽略了探索训练中语义特征的潜力,这很容易导致在其他领域应用时的败笔。在本文中,我们提出了一种名为多模态等效变压器(MMET)的多模态等效变压器,可分别应用于视觉、文本和视觉-文本任务,以学习更为鲁棒的视觉-语义嵌入。为了进一步加强Transformer中的鲁棒特征学习,引入了一种名为 Masked Multimodal Modeling(MMM)的动态屏蔽机制,可屏蔽图像补丁和文本标记,其适用于多模态或单模态数据,从而显着提高可推广的人员Re-ID性能。在基准数据集上的广泛实验表明,相较于先前的方法,我们的方法具有竞争力的性能。我们希望这种方法可以推进视觉-语义表示学习的研究。我们的源代码也可以在https://github.com/JeremyXSC/MMET公开获取。