Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames that are robust to nuisance factors such as motion blur and occlusions for the task of multi-person pose tracking. To address this issue, we propose an approach that relies on keypoint correspondences for associating persons in videos. Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision. Combined with a top-down framework for human pose estimation, we use keypoints correspondences to (i) recover missed pose detections (ii) associate pose detections across video frames. Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets.
翻译:因此,与大型图像数据集相比,多人构成估计和跟踪的数据集并不那么多样化,而且比大规模图像数据集更稀少。因此,要学习深层次的学习模型,将对骚扰性因素(如运动模糊和对多人构成跟踪任务进行隔离)具有强大的跨框架关键点联系起来,就具有挑战性。为了解决这一问题,我们建议采用依靠关键点通信将人与视频中的人联系起来的方法。它没有培训网络对视频数据的关键点通信进行估算,而是通过自我监督对大型图像数据集进行培训,以利用大型图像数据集进行人构成估计。结合自上而下框架的人类构成估计,我们使用关键点通信来(一) 恢复错失的配置探测(二) 跨视频框架的组合探测。我们的方法在多框架配置估算和多人配置跟踪PosTrack 2017美元和PoseTracrack 2018美元数据集上取得了最新的最新结果。