In this paper, we formulate inverse reinforcement learning (IRL) as an expert-learner interaction whereby the optimal performance intent of an expert or target agent is unknown to a learner agent. The learner observes the states and controls of the expert and hence seeks to reconstruct the expert's cost function intent and thus mimics the expert's optimal response. Next, we add non-cooperative disturbances that seek to disrupt the learning and stability of the learner agent. This leads to the formulation of a new interaction we call zero-sum game IRL. We develop a framework to solve the zero-sum game IRL problem that is a modified extension of RL policy iteration (PI) to allow unknown expert performance intentions to be computed and non-cooperative disturbances to be rejected. The framework has two parts: a value function and control action update based on an extension of PI, and a cost function update based on standard inverse optimal control. Then, we eventually develop an off-policy IRL algorithm that does not require knowledge of the expert and learner agent dynamics and performs single-loop learning. Rigorous proofs and analyses are given. Finally, simulation experiments are presented to show the effectiveness of the new approach.
翻译:在本文中,我们将反向强化学习(IRL)发展成专家-学习者互动,使学习者代理人不知道专家或目标代理人的最佳性能意图。学习者观察专家的状态和控制,从而试图重建专家的成本功能意图,从而模仿专家的最佳反应。接着,我们增加非合作性干扰,试图干扰学习者代理人的学习和稳定性。这导致形成一种新的互动,我们称之为零和游戏IRL。我们开发了一个框架,以解决零和游戏的IRL问题,这是RL政策迭代(PI)的修改延伸,允许计算未知的专家业绩意图,拒绝不起作用的干扰。这个框架有两个部分:价值功能和基于PI扩展的控制行动更新,以及基于标准反最佳控制更新的成本功能。然后,我们最终开发出一种不政策性IRL算法,不需要了解专家和学习者代理人的动态并进行单行距学习。最后,模拟试验显示新的效果。