Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging the eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of the gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from eye gaze and the denoised gaze feature modulated by the motion. Code and data can be found at https://github.com/y-zheng18/GIMO.
翻译:预测人类运动对于辅助机器人和AR/VR应用来说至关重要,因为与人类的互动需要安全和舒适。与此同时,准确的预测取决于对现场背景和人类意图的理解。尽管许多工作都研究有目共睹的人类运动预测,但由于缺少暴露人类意图的自我中心观点,以及运动和场景的多样性有限,人类运动的预测在很大程度上没有得到充分探讨。为了缩小差距,我们提议建立一个大型人类运动数据集,提供高质量的身体构成序列、现场扫描以及自我中心观点,其眼视作为推断人类意图的代言人。通过使用惯性感传感器来捕捉动作,我们的数据收集没有与具体的场景挂钩,这进一步推动了从我们主题观察到的运动动态动态。我们广泛研究利用眼睛观察来预测以自我中心为主的人类运动和场景和场景多样性的好处。此外,为了充分认识视觉的潜力,我们提议建立一个新的网络架构,使视觉和运动分支之间的双向通信成为替代平台。我们的网络通过使用惯性传感器收集动作的惯性感感感感感感感,我们的网络可以实现头影视/感变的顶部数据。