Its numerous applications make multi-human 3D pose estimation a remarkably impactful area of research. Nevertheless, assuming a multiple-view system composed of several regular RGB cameras, 3D multi-pose estimation presents several challenges. First of all, each person must be uniquely identified in the different views to separate the 2D information provided by the cameras. Secondly, the 3D pose estimation process from the multi-view 2D information of each person must be robust against noise and potential occlusions in the scenario. In this work, we address these two challenges with the help of deep learning. Specifically, we present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario along with a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person. These two models are trained in a self-supervised manner, thus avoiding the need for large datasets with 3D annotations.
翻译:然而,假设由几个常规 RGB 摄像头组成的多视图系统,三维多点估计提出了若干挑战。首先,必须在不同的观点中单独确定每个人,以便将摄像头提供的二维信息分开。第二,从每个人的多视图二D信息中得出的三维估计过程必须强有力,避免在情景中出现噪音和潜在排斥。在这项工作中,我们在深思熟虑的帮助下应对这两个挑战。具体地说,我们提出了一个模型,以图表神经网络为基础,能够预测场景中人们的交叉视图通信,同时使用多层 Percepron,使二维点产生每个人的三维构成。这两种模型都是以自我监督的方式培训的,从而避免了使用三维图解的大型数据集的需要。