Multi-person pose tracking is an important element for many applications and requires to estimate the human poses of all persons in a video and to track them over time. The association of poses across frames remains an open research problem, in particular for online tracking methods, due to motion blur, crowded scenes and occlusions. To tackle the association challenge, we propose a Dual-Source Attention Transformer that incorporates three core aspects: i) In order to re-identify persons that have been occluded, we propose a pose-conditioned re-identification network that provides an initial embedding and allows to match persons even if the number of visible joints differs between the frames. ii) We incorporate edge embeddings based on temporal pose similarity and the impact of appearance and pose similarity is automatically adapted. iii) We propose an attention based matching layer for pose-to-track association and duplicate removal. We evaluate our approach on Market1501, PoseTrack 2018 and PoseTrack21.
翻译:暂无翻译