Recently, with the advance of deep Convolutional Neural Networks (CNNs), person Re-Identification (Re-ID) has witnessed great success in various applications. However, with limited receptive fields of CNNs, it is still challenging to extract discriminative representations in a global view for persons under non-overlapped cameras. Meanwhile, Transformers demonstrate strong abilities of modeling long-range dependencies for spatial and sequential data. In this work, we take advantages of both CNNs and Transformers, and propose a novel learning framework named Hierarchical Aggregation Transformer (HAT) for image-based person Re-ID with high performance. To achieve this goal, we first propose a Deeply Supervised Aggregation (DSA) to recurrently aggregate hierarchical features from CNN backbones. With multi-granularity supervisions, the DSA can enhance multi-scale features for person retrieval, which is very different from previous methods. Then, we introduce a Transformer-based Feature Calibration (TFC) to integrate low-level detail information as the global prior for high-level semantic information. The proposed TFC is inserted to each level of hierarchical features, resulting in great performance improvements. To our best knowledge, this work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID. Comprehensive experiments on four large-scale Re-ID benchmarks demonstrate that our method shows better results than several state-of-the-art methods. The code is released at https://github.com/AI-Zhpp/HAT.
翻译:最近,随着深层革命神经网络(CNNs)的进步,个人再识别(HAT)在各种应用方面取得了巨大成功,然而,由于CNN的可接收领域有限,仍难以在全球视野中为非过度摄像头下的人进行有区别的表述;同时,变异器在空间和顺序数据方面表现出很强的建模能力,为个人检索建立远程依赖性模型。在这项工作中,我们利用CNN和变异器的优势,并提议一个名为高性能图像人再识别(HAT)的新学习框架。为了实现这一目标,我们首先提议在CNN骨干中采用深度超超超级聚合(DSA),以经常性的总体等级特征。随着多级性能监督,DSA可以增强个人检索的多级特征,这与以往的方法非常不同。然后,我们引入了一个基于变异功能的变异功能校准(TFC),将低级详细信息整合为全球前高水平的图像再识别数据。为了实现这一目标,我们首先提出“超超超超超级聚合”的图像,TFC在高层次上展示了“最高等级”的系统。