To align advanced artificial intelligence (AI) with human values and promote safe AI, it is important for AI to predict the outcome of physical interactions. Even with the ongoing debates on how humans predict the outcomes of physical interactions among objects in the real world, there are works attempting to tackle this task via cognitive-inspired AI approaches. However, there is still a lack of AI approaches that mimic the mental imagery humans use to predict physical interactions in the real world. In this work, we propose a novel PIP scheme: Physical Interaction Prediction via Mental Imagery with Span Selection. PIP utilizes a deep generative model to output future frames of physical interactions among objects before extracting crucial information for predicting physical interactions by focusing on salient frames using span selection. To evaluate our model, we propose a large-scale SPACE+ dataset of synthetic video frames, including three physical interaction events in a 3D environment. Our experiments show that PIP outperforms baselines and human performance in physical interaction prediction for both seen and unseen objects. Furthermore, PIP's span selection scheme can effectively identify the frames where physical interactions among objects occur within the generated frames, allowing for added interpretability.
翻译:为了让先进的人工智能(AI)与人类价值观保持一致,并促进安全的人工智能,AI必须预测物理互动的结果。即使目前关于人类如何预测现实世界中物体之间物理互动结果的辩论,人们也正在努力通过认知启发的AI方法来应对这项任务。然而,仍然缺乏模拟人类心理图像用于预测现实世界中物理互动的AI方法。在这项工作中,我们提出了一个新的PIP计划:通过Span选择的心理图像预测物理互动。PIP利用一个深层的遗传模型来输出天体之间物理互动的未来框架,然后通过利用光谱选择的突出框架提取关键信息来预测物理互动。为了评估我们的模型,我们提议了一个大规模空间+合成视频框架数据集,包括3D环境中的3个物理互动事件。我们的实验表明,PIP超越了对可见和看不见物体进行物理互动预测的基线和人类性能。此外,PIP的选择计划可以有效地确定在生成的框架中物体之间发生物理互动的框框,允许增加解释性。