In the current person Re-identification (ReID) methods, most domain generalization works focus on dealing with style differences between domains while largely ignoring unpredictable camera view change, which we identify as another major factor leading to a poor generalization of ReID methods. To tackle the viewpoint change, this work proposes to use a 3D dense pose estimation model and a texture mapping module to map the pedestrian images to canonical view images. Due to the imperfection of the texture mapping module, the canonical view images may lose the discriminative detail clues from the original images, and thus directly using them for ReID will inevitably result in poor performance. To handle this issue, we propose to fuse the original image and canonical view image via a transformer-based module. The key insight of this design is that the cross-attention mechanism in the transformer could be an ideal solution to align the discriminative texture clues from the original image with the canonical view image, which could compensate for the low-quality texture information of the canonical view image. Through extensive experiments, we show that our method can lead to superior performance over the existing approaches in various evaluation settings.
翻译:在目前的人重新身份识别(ReID)方法中,大多数域的概括性工作侧重于处理不同域间的风格差异,同时基本上忽视不可预测的相机视图变化,我们认为这是导致ReID方法一般化不良的另一个主要因素。为了应对观点变化,这项工作提议使用3D密集的图像估计模型和一个纹理绘图模块来将行人图像映射成光学图像。由于纹理映像模块的不完善,直观图像可能会失去原始图像的歧视性细节线索,因此直接用于ReID将不可避免地造成不良的性能。为了处理这一问题,我们提议通过一个基于变压器的模块将原始图像和可视像图像结合。这一设计的关键洞察力是,变压器中的交叉注意机制可以成为将原始图像中的歧视性文本线索与直观图像相匹配的理想解决方案,这可以弥补原始图像中低质量的纹理信息。通过广泛的实验,我们证明我们的方法可以导致在各种评估环境中现有方法的优异性。