Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at https://haozhi.io/RPIN.
翻译:学习长期动态模型是理解物理常识的关键。 大部分现有的从视觉投入中学习动态的方法都是通过采用短期模型快速重新规划,从长期预测中绕过长期预测,这不仅要求这类模型的超准确性,而且将其限制在代理人能够持续获得反馈和在每一步骤采取行动直至完成的任务上。 在本文件中,我们的目标是利用视觉识别任务的成功故事中的想法,建立能够捕捉长期内物体和物体-环境相互作用的物体表示;为此,我们提议区域建议互动网络(RPIN),它说明每个物体在潜在区域提议空间的轨迹。由于简单而有效的物体表示方式,我们的方法在预测质量和下游任务规划能力方面大大优于先前的方法,同时对新环境进行概括。 https://haozhi.io/RPIN提供了守则、预先培训的模型和更多的可视化结果。