This paper deals with robotic lever control using Explainable Deep Reinforcement Learning. First, we train a policy by using the Deep Deterministic Policy Gradient algorithm and the Hindsight Experience Replay technique, where the goal is to control a robotic manipulator to manipulate a lever. This enables us both to use continuous states and actions and to learn with sparse rewards. Being able to learn from sparse rewards is especially desirable for Deep Reinforcement Learning because designing a reward function for complex tasks such as this is challenging. We first train in the PyBullet simulator, which accelerates the training procedure, but is not accurate on this task compared to the real-world environment. After completing the training in PyBullet, we further train in the Gazebo simulator, which runs more slowly than PyBullet, but is more accurate on this task. We then transfer the policy to the real-world environment, where it achieves comparable performance to the simulated environments for most episodes. To explain the decisions of the policy we use the SHAP method to create an explanation model based on the episodes done in the real-world environment. This gives us some results that agree with intuition, and some that do not. We also question whether the independence assumption made when approximating the SHAP values influences the accuracy of these values for a system such as this, where there are some correlations between the states.
翻译:本文涉及使用可解释的深强化学习的机器人杠杆控制。 首先, 我们通过使用深确定政策分级算法和 Hindsight 经验重玩技术来培训一项政策, 目标是控制机器人操纵者操纵杠杆。 这使得我们既能使用连续的状态和行动, 也能以微薄的奖励来学习。 深强化学习特别需要从微薄的奖励中学习, 因为设计了类似复杂任务的奖赏功能具有挑战性。 我们首先在 PyBullet 模拟器中培训, 该模拟器加快了培训程序, 但与现实世界环境相比, 却不准确。 在完成在 PyBullet 的培训后, 我们进一步在 Gazebo 模拟器中培训, 该模拟器比 PyBullet 运行得更慢, 并且能够以微弱的奖赏来学习。 我们然后将政策转移到现实世界环境, 因为它能取得与模拟环境相似的成绩。 我们使用SHAP 方法来根据现实世界环境中的一些事件来创建一个解释模型模型, 也就是我们判断这些判断是否正确性 。