Estimating 3D poses of multiple humans in real-time is a classic but still challenging task in computer vision. Its major difficulty lies in the ambiguity in cross-view association of 2D poses and the huge state space when there are multiple people in multiple views. In this paper, we present a novel solution for multi-human 3D pose estimation from multiple calibrated camera views. It takes 2D poses in different camera coordinates as inputs and aims for the accurate 3D poses in the global coordinate. Unlike previous methods that associate 2D poses among all pairs of views from scratch at every frame, we exploit the temporal consistency in videos to match the 2D inputs with 3D poses directly in 3-space. More specifically, we propose to retain the 3D pose for each person and update them iteratively via the cross-view multi-human tracking. This novel formulation improves both accuracy and efficiency, as we demonstrated on widely-used public datasets. To further verify the scalability of our method, we propose a new large-scale multi-human dataset with 12 to 28 camera views. Without bells and whistles, our solution achieves 154 FPS on 12 cameras and 34 FPS on 28 cameras, indicating its ability to handle large-scale real-world applications. The proposed dataset is released at https://github.com/longcw/crossview_3d_pose_tracking.
翻译:估计3D的多重人实时3D构成是一个经典但依然具有挑战性的任务,在计算机视野中,它的主要困难在于2D的交叉视图组合的模糊性,以及多重观点中存在多人的庞大状态空间。在本文中,我们提出了一个关于多人3D的新型解决方案,从多校准相机视图中进行多人3D的估算。它在不同摄像座标中进行2D的配置,作为全球坐标中准确的3D构成的投入和目的。与以往将2D从每个框架从头到尾所有观点联系起来的方法不同,我们利用视频中的时间一致性来将2D输入的3D直接与3D的3-空间相匹配。更具体地说,我们提议为每个人保留3D的组合,并通过跨视图的多人跟踪进行迭接更新。这种新配方的配置提高了准确性和效率,我们在广泛使用的公共数据集中展示了这一点。为了进一步核实我们的方法的可缩缩缩略性,我们提议了一个新的大型多人数据集,有12至28个摄像视图。不做答和吹哨,我们的解决办法在12个已推出的FPS3D3系统上实现了154FPS3的应用程序在12/FPS