可用于真实世界监控中的多模态人物识别的视觉与红外fusion模型，处理带有噪声的数据 (Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data)

Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images are corrupted by, e.g, blur, noise, and weather. Indeed, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID -- named Multimodal Middle Stream Fusion (MMSF) -- that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing to dynamically balance each modality importance. Recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, our ML-MDA is an important strategy for a V-I person ReID system to sustain high accuracy and robustness when processing corrupted multimodal images. Also, our multimodal ReID model MMSF outperforms every method under CL and NCL camera scenarios.

翻译：摘要：可见-红外人物识别(V-I ReID) 要求匹配由RGB和IR摄像机捕捉的个人图像。由于V和I模式之间的差异特别明显，尤其是在真实世界条件下，图像可能会被失真，噪声和气象等情况损坏。现有的V-I ReID模型无法利用受损模式的信息来保持高水平的准确性。本文提出了一种多模态V-I ReID的有效模型 - 命名为多模态中间流融合(MMSF)，以保留模式特定的知识，从而提高对受损多模态图像的抗干扰能力。此外，本文适应了三种最先进的基于注意机制的多模态融合模型来解决V-I ReID中受损多模态数据的问题，从而动态平衡每种模式的重要性。为了评估ReID模型在具有挑战性的真实世界场景下的鲁棒性，最近提出了评估方法。然而，这些协议仅适用于单模态V场景。为了实现对多模态（和跨模态）V - I 人物 ReID模型的实际评估，我们为V和I摄像机共位(CL)和非协同(NCL)的场景提供了具有挑战性的损坏数据集。最后，探讨了我们的屏蔽和本地多模态数据扩充(ML-MDA)策略的优势，以提高对受损多模态图像的ReID模型的鲁棒性。我们对SYSU-MM01，RegDB和ThermalWORLD数据集的清洁和受损版本进行的实验结果，表明多模态V-I ReID模型在处理受损多模态图像时表现出更好的性能。特别是，当处理共位和非共位相机场景时，我们的ML-MDA是实现V-I人物ReID系统维持高准确性和鲁棒性的重要策略。此外，我们的多模态ReID模型MMSF在处理共位和非共位相机场景时优于每种方法。