Existing imitation learning methods mainly focus on making an agent effectively mimic a demonstrated behavior, but do not address the potential contradiction between the behavior style and the objective of a task. There is a general lack of efficient methods that allow an agent to partially imitate a demonstrated behavior to varying degrees, while completing the main objective of a task. In this paper we propose a method called Regularized Soft Actor-Critic which formulates the main task and the imitation task under the Constrained Markov Decision Process framework (CMDP). The main task is defined as the maximum entropy objective used in Soft Actor-Critic (SAC) and the imitation task is defined as a constraint. We evaluate our method on continuous control tasks relevant to video games applications.
翻译:现有的模仿学习方法主要侧重于使代理人有效地模仿被证明的行为,但没有解决行为风格与任务目标之间潜在矛盾的问题。 普遍缺乏有效方法允许代理人在不同程度上部分模仿被证明的行为,同时完成一项任务的主要目标。 在本文中,我们提议了一种叫作正规化的软动作- Critic-Critic 的方法,该方法在 Constract Markov 决策程序框架( CMDP)下制定主要任务和模仿任务。 主要任务被定义为Soft Actor-Critic (SAC) 中使用的最大恒温目标,而模仿任务被定义为限制。 我们评估了与视频游戏应用有关的连续控制任务的方法。