Multi-person pose estimation and tracking serve as crucial steps for video understanding. Most state-of-the-art approaches rely on first estimating poses in each frame and only then implementing data association and refinement. Despite the promising results achieved, such a strategy is inevitably prone to missed detections especially in heavily-cluttered scenes, since this tracking-by-detection paradigm is, by nature, largely dependent on visual evidences that are absent in the case of occlusion. In this paper, we propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame, and hence may serve as a robust estimation even in challenging scenarios including occlusion. Specifically, we derive this prediction of dynamics through a graph neural network~(GNN) that explicitly accounts for both spatial-temporal and visual information. It takes as input the historical pose tracklets and directly predicts the corresponding poses in the following frame for each tracklet. The predicted poses will then be aggregated with the detected poses, if any, at the same frame so as to produce the final pose, potentially recovering the occluded joints missed by the estimator. Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
翻译:多数最先进的方法都依赖于对每个框架的配置进行初步估算,然后才实施数据关联和完善。尽管取得了有希望的成果,但这一战略不可避免地会漏掉探测,特别是在严重混乱的场景中,因为这一逐次检测的范式从本质上看,在很大程度上取决于隐蔽情况下缺少的视觉证据。在本文中,我们提出一种新的在线方法,以了解外形动态,这种外形动态独立于当前名声的检测,因此,即使是在具有挑战性的情景中,也可能起到强有力的估计作用,包括隐蔽。具体地说,我们通过图表神经网络~(GNNN)得出这种动态预测,明确描述空间-时空和视觉信息,作为输入历史的外形跟踪,直接预测每个轨图下框中的相应配置。然后,预测的外观将与所检测到的外观(如果有的话)组合相匹配,从而产生最后的外观,从而有可能恢复201717年期估算仪图中的拟议空间跟踪结果,并演示2017年时标图上的拟议空间跟踪结果。