Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras. Previous methods usually focus on limited views, such as spatial, temporal or spatial-temporal view, which lack of the observations in different feature domains. To capture richer perceptions and extract more comprehensive video representations, in this paper we propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID. More specifically, we design a trigeminal feature extractor to jointly transform raw video data into spatial, temporal and spatial-temporal domain. Besides, inspired by the great success of vision transformer, we introduce the transformer structure for video-based person Re-ID. In our work, three self-view transformers are proposed to exploit the relationships between local features for information enhancement in spatial, temporal and spatial-temporal domains. Moreover, a cross-view transformer is proposed to aggregate the multi-view features for comprehensive video representations. The experimental results indicate that our approach can achieve better performance than other state-of-the-art approaches on public Re-ID benchmarks. We will release the code for model reproduction.
翻译:基于视频的人重新定位(Re-ID)的目的是在非重叠相机下检索同一个人的视频序列(Re-ID) 。 以往的方法通常侧重于有限的观点,例如空间、时空或时空视图,这些观点在不同特征领域缺乏观测。 为了捕捉更丰富的认知,并提取更全面的视频演示,我们在本文件中提议为视频人重新定位建立一个名为Trigeminal变异器(TMT)的新颖框架。更具体地说,我们设计了一个三角特征提取器,将原始视频数据联合转化为空间、时间和空间-时空域域。此外,在视觉变异器的巨大成功启发下,我们为基于视频的人重新识别引入变异器结构。 在我们的工作中,我们建议三个自我视图变异器来利用本地特征之间的关系,以便在空间、时间和空间-时空空间-时空领域加强信息。 此外,我们提议了一个跨视图变异器,以汇总用于全面视频演示的多视图特征。实验结果表明,我们的方法可以比公共再识别基准的其他状态方法取得更好的性能。 我们将发布复制模式代码。