Learning to manipulate 3D objects in an interactive environment has been a challenging problem in Reinforcement Learning (RL). In particular, it is hard to train a policy that can generalize over objects with different semantic categories, diverse shape geometry and versatile functionality. Recently, the technique of visual affordance has shown great prospects in providing object-centric information priors with effective actionable semantics. As such, an effective policy can be trained to open a door by knowing how to exert force on the handle. However, to learn the affordance, it often requires human-defined action primitives, which limits the range of applicable tasks. In this study, we take advantage of visual affordance by using the contact information generated during the RL training process to predict contact maps of interest. Such contact prediction process then leads to an end-to-end affordance learning framework that can generalize over different types of manipulation tasks. Surprisingly, the effectiveness of such framework holds even under the multi-stage and the multi-agent scenarios. We tested our method on eight types of manipulation tasks. Results showed that our methods outperform baseline algorithms, including visual-based affordance methods and RL methods, by a large margin on the success rate. The demonstration can be found at https://sites.google.com/view/rlafford/.
翻译:在互动环境中学习操控 3D 对象一直是强化学习(RL) 中一个具有挑战性的问题。 特别是, 很难培训一项能够对具有不同语义类别、不同形状几何和多功能功能的物体进行概括化的政策。 最近, 视觉可视化技术在提供以物体为中心的信息前端和有效可操作的语义学方面展现了巨大的前景。 因此, 有效的政策可以通过了解如何在操作器上施压来打开大门。 然而, 要学习发价, 往往需要人类定义的行动原始, 从而限制适用任务的范围。 在这项研究中, 我们利用在RL培训过程中产生的接触信息来预测感兴趣的接触地图, 从而利用视觉可负担能力。 这种接触预测过程随后导致一个端到端的支付能力学习框架, 能够对不同类型的操作任务进行概括化。 令人惊讶的是, 这种框架的有效性甚至在多阶段和多剂情景下仍然存在。 我们在八种操作任务上测试了我们的方法。 结果显示, 我们的方法超越了基线的算法, 包括视觉/ 价格/ 成功率 方法 。