Deep reinforcement learning has shown promising results on an abundance of robotic tasks in simulation, including visual navigation and manipulation. Prior work generally aims to build embodied agents that solve their assigned tasks as quickly as possible, while largely ignoring the problems caused by collision with objects during interaction. This lack of prioritization is understandable: there is no inherent cost in breaking virtual objects. As a result, "well-trained" agents frequently collide with objects before achieving their primary goals, a behavior that would be catastrophic in the real world. In this paper, we study the problem of training agents to complete the task of visual mobile manipulation in the ManipulaTHOR environment while avoiding unnecessary collision (disturbance) with objects. We formulate disturbance avoidance as a penalty term in the reward function, but find that directly training with such penalized rewards often results in agents being unable to escape poor local optima. Instead, we propose a two-stage training curriculum where an agent is first allowed to freely explore and build basic competencies without penalization, after which a disturbance penalty is introduced to refine the agent's behavior. Results on testing scenes show that our curriculum not only avoids these poor local optima, but also leads to 10% absolute gains in success rate without disturbance, compared to our state-of-the-art baselines. Moreover, our curriculum is significantly more performant than a safe RL algorithm that casts collision avoidance as a constraint. Finally, we propose a novel disturbance-prediction auxiliary task that accelerates learning.
翻译:深层强化学习在模拟中大量机器人任务(包括视觉导航和操纵)的模拟中展示了可喜的结果。 先前的工作通常旨在建设能尽快解决指定任务(即冲动)的体现剂,同时基本上忽视互动中与物体碰撞造成的问题。 这种缺乏优先排序是可以理解的:打破虚拟物体没有内在成本。 结果, “ 训练有素的” 剂在实现其主要目标之前经常与物体发生碰撞,这种行为在现实世界中将是灾难性的。 在本文中,我们研究培训代理人完成ManipulaTHOR环境中的视觉移动操纵任务的问题,同时避免与目标发生不必要的碰撞(暴动 ) 。 我们在奖赏功能中将避免扰动作为惩罚期,但发现通过这种惩罚性奖励直接培训往往导致代理人无法逃脱当地偏好的目标。 相反,我们提议了一个两阶段的培训课程,允许代理人在没有惩罚的情况下自由探索和建设基本能力,在现实世界中引入扰动惩罚以完善代理人的行为。 测试场的结果显示,我们的课程不仅避免了这些不必要的地方选择性碰撞(暴动)与目标的碰撞(暴动)碰撞(暴动)冲突),而且最终导致10 %的稳定性学习率成功率。 我们的成绩比我们更低的成绩要高的升级。