Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from the gaze and the denoised gaze feature modulated by the motion. The proposed dataset and our network implementation will be publicly available.
翻译:预测人类运动对于辅助机器人和AR/VR应用来说至关重要,因为与人类的互动需要安全和舒适。与此同时,准确的预测取决于对现场背景和人类意图的理解。尽管许多工作都研究有目共睹的人类运动预测,但由于缺少以自我为中心的观点,这些观点暴露了人类的意图,而且运动和场景的多样性有限,人类运动的预测在很大程度上没有得到充分探讨。为了缩小差距,我们建议建立一个大型的人类运动数据集,提供高质量的身体构成序列、现场扫描以及自我中心观点,以眼视作为推断人类意图的假象。通过使用惯性传感器来捕捉运动,我们的数据收集没有与特定的场景挂钩,这进一步增强了从我们主题观察到运动的动态。我们广泛研究了利用视视力来预测以自我为中心的人类运动和场景多样性的好处。为了实现全方位的视觉潜力,我们建议建立一个新的网络结构,使视觉和运动分支之间能够进行双向通信,作为推断人类意图的代言人意图。我们的网络将实现最高级的视觉状态,通过公开的图像预测实现。我们提议的网络的视觉状态将使得人们能够进行最高级的图像状态。