Video-based person re-identification (re-ID) aims at matching the same person across video clips. Efficiently exploiting multi-scale fine-grained features while building the structural interaction among them is pivotal for its success. In this paper, we propose a hybrid framework, Dense Interaction Learning (DenseIL), that takes the principal advantages of both CNN-based and Attention-based architectures to tackle video-based person re-ID difficulties. DenseIL contains a CNN Encoder and a Transformer Decoder. The CNN Encoder is responsible for efficiently extracting discriminative spatial features while the Transformer Decoder is designed to deliberately model spatial-temporal inherent interaction across frames. Different from the vanilla Transformer, we additionally let the Transformer Decoder densely attends to intermediate fine-grained CNN features and that naturally yields multi-scale spatial-temporal feature representation for each video clip. Moreover, we introduce Spatio-TEmporal Positional Embedding (STEP-Emb) into the Transformer Decoder to investigate the positional relation among the spatial-temporal inputs. Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
翻译:基于视频的人的重新定位(re-ID)旨在通过视频剪辑来匹配同一个人。有效地利用多尺度微细刻度特征,同时在它们之间建立结构互动对于成功至关重要。在本文中,我们提议了一个混合框架,即 " 常温互动学习 " (DenseIL),它利用基于CNN的和基于关注的架构的主要优势来解决基于视频的人的重新定位困难。DenseIL包含一个CNN Encoder和一个变换器解码器。CNN Ecoder负责有效提取歧视性空间特征,而变换器解码器的设计是刻意地模拟空间-时空内在互动的跨框架。与Vanilla变换器不同,我们还让变换器变换器变换器快速地关注中间微变亮CNN的特征,并自然产生每个视频剪辑的多尺度空间-时空特征代表。此外,我们将Spatio-TEP-EME-Emb(STEP-Emb)引入变换器的变换式解码器,以调查空间-时空空间-时空图像标准输入中的所有数据方法之间的定位。我们始终不断地将所有数据转换式都比。