Capturing the interactions between humans and their environment in 3D is important for many applications in robotics, graphics, and vision. Recent works to reconstruct the 3D human and object from a single RGB image do not have consistent relative translation across frames because they assume a fixed depth. Moreover, their performance drops significantly when the object is occluded. In this work, we propose a novel method to track the 3D human, object, contacts between them, and their relative translation across frames from a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative translation across frames. Second, human and object motion from visible frames provides valuable information to infer the occluded object. We propose a novel transformer-based neural network that explicitly uses object visibility and human motion to leverage neighbouring frames to make predictions for the occluded frames. Building on these insights, our method is able to track both human and object robustly even under occlusions. Experiments on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpi-inf.mpg.de/VisTracker
翻译:摘要:在机器人、图形和视觉等许多应用中,捕捉人与环境之间的交互对于三维重建非常重要。最近,从单个RGB图像重建出人和物体在三维空间中的仿真效果并不一致,因为它们假定在不同帧之间存在固定的深度。此外,当物体被遮挡时,它们的性能显著下降。为此,本文提出了一种新的方法,可以从单个RGB相机中跟踪三维人体、物体、它们之间的接触情况以及它们的相对位移,同时对严重的遮挡具有鲁棒性。本方法基于两个关键洞察。首先,我们根据预先适合于视频序列的SMPL(一种三维人体姿势模型)模型估计结果来调节我们的神经场人体和物体重建,从而提高神经重建的准确性并产生一致的相对位移。其次,可见帧中的人体和物体运动为推断被遮挡的物体提供了有价值的信息。我们提出了一种新颖的基于Transformer的神经网络,利用物体可见性和人体运动来利用相邻帧来预测被遮挡帧。基于这些见解,我们的方法能够在遮挡情况下稳健地跟踪人体和物体。在两个数据集上的实验表明,我们的方法明显优于现有最先进的方法。我们的代码和预先训练模型可从 https://virtualhumans.mpi-inf.mpg.de/VisTracker 获取。