This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video. We propose an extremely lightweight yet highly effective approach that builds upon the latest advancements in human detection and video understanding. Our method operates in two-stages: keypoint estimation in frames or short clips, followed by lightweight tracking to generate keypoint predictions linked over the entire video. For frame-level pose estimation we experiment with Mask R-CNN, as well as our own proposed 3D extension of this model, which leverages temporal information over small clips to generate more robust frame predictions. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our approach achieves an accuracy of 55.2% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state of the art performance on the ICCV 2017 PoseTrack keypoint tracking challenge.
翻译:本文探讨在复杂、多人视频中估算和跟踪人体关键点的问题。 我们提出一种极轻但高度有效的方法,该方法以人类检测和视频理解的最新进展为基础。 我们的方法分为两个阶段:在框架或短片中进行关键点估计,然后是轻量跟踪,以产生整个视频中连接的关键点预测。 对于框架层面,我们用Mask R-CNN以及我们自己提议的3D扩展模型来进行估计,利用时间信息在小片上生成更强有力的框架预测。 我们对新发布的多人视频构成估计基准(PoseTrack)进行了广泛的模拟实验,以验证我们模型的各种设计选择。我们的方法在验证上实现了55.2%的准确率,在测试集上实现了51.8%的准确率,使用了多点跟踪跟踪精度(MOTA)的测量,并实现了在 ICCV 20177 PoseTrac Knk关键点跟踪挑战方面的艺术表现状况。