Approaches based on generative adversarial networks for imitation learning are promising because they are sample efficient in terms of expert demonstrations. However, training a generator requires many interactions with the actual environment because model-free reinforcement learning is adopted to update a policy. To improve the sample efficiency using model-based reinforcement learning, we propose model-based Entropy-Regularized Imitation Learning (MB-ERIL) under the entropy-regularized Markov decision process to reduce the number of interactions with the actual environment. MB-ERIL uses two discriminators. A policy discriminator distinguishes the actions generated by a robot from expert ones, and a model discriminator distinguishes the counterfactual state transitions generated by the model from the actual ones. We derive the structured discriminators so that the learning of the policy and the model is efficient. Computer simulations and real robot experiments show that MB-ERIL achieves a competitive performance and significantly improves the sample efficiency compared to baseline methods.
翻译:以基因对抗性网络为基础的模拟学习方法很有希望,因为它们在专家示范方面是有效的样本。然而,培训产生者需要与实际环境进行许多互动,因为采用不使用模型的强化学习来更新政策。为了利用基于模型的强化学习来提高样本效率,我们提议在昆虫正规化的Markov决策程序下,采用基于模型的英特罗比-重新分类的模拟学习(MB-ERIL),以减少与实际环境的互动次数。MB-ERIL使用两个歧视者。政策歧视者将机器人的行为与专家的行为区分开来,而一个模型歧视者则将模型产生的反事实状态的转变与实际的转变区分开来。我们从结构化的区别开来,这样,对政策和模型的学习是有效率的。计算机模拟和真正的机器人实验表明,MB-ERIL能够取得竞争性的绩效,并大大提高样本与基线方法相比的效率。