Imitation learning (IL) is a framework that learns to imitate expert behavior from demonstrations. Recently, IL shows promising results on high dimensional and control tasks. However, IL typically suffers from sample inefficiency in terms of environment interaction, which severely limits their application to simulated domains. In industrial applications, learner usually have a high interaction cost, the more interactions with environment, the more damage it causes to the environment and the learner itself. In this article, we make an effort to improve sample efficiency by introducing a novel scheme of inverse reinforcement learning. Our method, which we call \textit{Model Reward Function Based Imitation Learning} (MRFIL), uses an ensemble dynamic model as a reward function, what is trained with expert demonstrations. The key idea is to provide the agent with an incentive to match the demonstrations over a long horizon, by providing a positive reward upon encountering states in line with the expert demonstration distribution. In addition, we demonstrate the convergence guarantee for new objective function. Experimental results show that our algorithm reaches the competitive performance and significantly reducing the environment interactions compared to IL methods.
翻译:模拟学习( IL) 是学习模仿示范中专家行为的框架。 最近, IL 在高维和控制任务上展示了有希望的成果。 但是, IL 通常在环境互动方面缺乏效率,严重限制了对模拟域的应用。 在工业应用中, 学习者通常具有高互动成本, 与环境的互动越多, 对环境和学习者本身造成的伤害越大。 在文章中, 我们努力通过引入新的反强化学习计划来提高样本效率。 我们称之为“ Motuit{ Model Reward 函数基于模拟学习”的方法( MRFIL), 使用一个混合动态模型作为奖励功能, 并经过专家演示培训。 关键的想法是向代理人提供激励, 以便他们长期匹配演示, 在遇到与专家演示分布一致的州时给予积极奖励。 此外, 我们展示了对新的客观功能的趋同保证。 实验结果显示, 我们的算法达到了竞争性性, 并大大降低了环境与 IL 方法的相互作用。