视频人再身份识别多标准多标准Spadio-临时聚合变异器 (Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification)

from arxiv, This manuscript was just accepted for publication as a regular paper in the IEEE Transactions on Multimedia. We have uploaded source PdfLateX files this time

In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

翻译：近年来,变异器结构在视频个人再识别任务中表现出其优越性。在视频代表学习的启发下,这些方法主要侧重于设计模块,以提取信息丰富的空间和时间特征;然而,在提取对人再识别任务至关重要的本地属性和全球身份信息方面,它们仍然有限。在本文中,我们提出了一个新的多标准空间-时间聚合变异器(MSTAT),其中有两个新颖设计的代用嵌入模块,用于解决上述问题。具体地说,MSTAT由三个阶段组成,分别从视频剪辑中编码属性相关、身份相关和属性身份相关信息,分别用于对属性和时间特征信息进行编码,以获得对输入人的整体认识。我们把所有阶段的产出结合起来,以最终识别任务。在实际中,为了节省计算成本,每个阶段首先采用空间-时间聚合(STA)模块,以在空间和时间层面分别进行拟议的自我存储操作。我们还可以在视频剪贴模块(AAP和IAP)中分别对属性和属性相关属性嵌入模块(AAP)进行编码,以便从新定义和缩缩略图中提取具体操作结果。