The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. In this work, we introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. Solving this problem requires a generalizable perception system that can classify which human body joints will collide and estimate a collision region heatmap to localize collisions in the environment. To achieve this, we propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism. To train our model and enable future research on this task, we develop a synthetic data generation framework that produces egocentric videos of virtual humans moving and colliding within diverse 3D environments. This framework is then used to establish a large-scale dataset consisting of 8.6M egocentric RGBD frames. Extensive experiments show that COPILOT generalizes to unseen synthetic as well as real-world scenes. We further demonstrate COPILOT outputs are useful for downstream collision avoidance through simple closed-loop control. Please visit our project webpage at https://sites.google.com/stanford.edu/copilot.
翻译:能够从自我视角观察中预测人-环境碰撞的能力对于VR、AR和可穿戴式辅助机器人等应用的碰撞避免至关重要。在这项工作中,我们介绍了从佩戴在身体上的摄像头捕获的多视点自我视角视频中预测多样化环境中碰撞的复杂问题。解决这个问题需要一个有通用性的感知系统,它能分类哪些人体关节将发生碰撞,并估计一个碰撞区域热度图来定位碰撞在环境中的位置。为了实现这一目标,我们提出了一种名为COPILOT的基于Transformer的模型,同时执行碰撞预测和定位,通过一种新的4D时空视点注意机制,跨多视点输入累积信息。为了训练我们的模型并促进对这个任务未来的研究,我们开发了一个合成数据生成框架,可以在多样化的3D环境中生成虚拟人移动和碰撞的自我视角视频。该框架被用于建立一个包含860万个自我视角RGBD帧的大规模数据集。广泛的实验证明,COPILOT不仅可以适应实验室训练和测试场景,而且可以适应新颖的合成和现实环境场景。我们进一步证明了COPILOT的输出通过简单闭环控制对于下游碰撞避免是有用的。请访问我们的项目网页,网址为https://sites.google.com/stanford.edu/copilot。