In this paper, we propose a novel feature learning framework for video person re-identification (re-ID). The proposed framework largely aims to exploit the adequate temporal information of video sequences and tackle the poor spatial alignment of moving pedestrians. More specifically, for exploiting the temporal information, we design a temporal residual learning (TRL) module to simultaneously extract the generic and specific features of consecutive frames. The TRL module is equipped with two bi-directional LSTM (BiLSTM), which are respectively responsible to describe a moving person in different aspects, providing complementary information for better feature representations. To deal with the poor spatial alignment in video re-ID datasets, we propose a spatial-temporal transformer network (ST^2N) module. Transformation parameters in the ST^2N module are learned by leveraging the high-level semantic information of the current frame as well as the temporal context knowledge from other frames. The proposed ST^2N module with less learnable parameters allows effective person alignments under significant appearance changes. Extensive experimental results on the large-scale MARS, PRID2011, ILIDS-VID and SDU-VID datasets demonstrate that the proposed method achieves consistently superior performance and outperforms most of the very recent state-of-the-art methods.
翻译:在本文中,我们提议为视频人重新识别(re-ID)提供一个新型特征学习框架。拟议框架主要旨在利用视频序列的适当时间信息,并解决移动行人空间调整差的问题。更具体地说,为利用时间信息,我们设计了一个时间剩余学习模块,以同时提取连续框架的通用和具体特征。TRL模块配备了两个双向LSTM双向LSTM(BILSTM),它们分别负责描述移动人员的不同方面,为更好地显示特征提供补充信息。为处理视频重置数据集空间调整差的问题,我们建议建立一个空间时空变异器网络(ST%2N)模块。ST%2N模块的转换参数是通过利用当前框架的高级别语义信息以及其他框架的时间背景知识来学习的。拟议的ST%2N模块具有不易学参数,允许在显著的外观变化下对人进行有效的调整。在大型 MARS、PRID2011、IDS-VID和SDV-PER 模块模块中,大规模的实验结果显示最新的高级性表现方法。