In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. In this paper, we propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information. We also introduce an auxiliary task that utilizes a new pair of features created through switching and aggregation to increase the network's capability for various camera scenarios. Furthermore, we devise a Target Localization Module (TLM) that extracts robust features against a change in the position of the target according to the frame flow and a Frame Weight Generation (FWG) that reflects temporal information in the final representation. Various loss functions for disentanglement learning are designed so that each component of the network can cooperate while satisfactorily performing its own role. Quantitative and qualitative results from extensive experiments demonstrate the superiority of DSANet over state-of-the-art methods on three benchmark datasets.
翻译:在视频人重新识别(Re-ID)中,网络必须始终不断地从相框中提取目标人的特征(Re-ID),现有的方法往往只侧重于如何使用时间信息,这往往导致网络被相似的外观和相同背景所愚弄。在本文中,我们建议建立一个分离和切换和聚合网络(DSANet),根据相机特征将代表身份和特征的特征分开,并更加注意ID信息。我们还引入了一项辅助任务,利用通过切换和汇总创造的一对新特征来提高网络在各种相机情景下的能力。此外,我们设计了一个目标本地化模块(TLM),在根据框架流和框架重力生成反映最终表达中时间信息的目标位置变化的情况下,产生强健的特征。各种分离学习损失功能的设计是为了使网络的每个组成部分在令人满意地发挥其作用时能够进行合作。从广泛的实验中获得的定量和定性结果表明DSANet优于三个基准数据集的状态方法。