In complex environments with high dimension, training a reinforcement learning (RL) model from scratch often suffers from lengthy and tedious collection of agent-environment interactions. Instead, leveraging expert demonstration to guide RL agent can boost sample efficiency and improve final convergence. In order to better integrate expert prior with on-policy RL models, we propose a generic framework for Learning from Demonstration (LfD) based on actor-critic algorithms. Technically, we first employ K-Means clustering to evaluate the similarity of sampled exploration with demonstration data. Then we increase the likelihood of actions in similar frames by modifying the gradient update strategy to leverage demonstration. We conduct experiments on 4 standard benchmark environments in Mujoco and 2 self-designed robotic environments. Results show that, under certain condition, our algorithm can improve sample efficiency by 20% ~ 40%. By combining our framework with on-policy algorithms, RL models can accelerate convergence and obtain better final mean episode rewards especially in complex robotic context where interactions are expensive.
翻译:在具有高维度的复杂环境中,从零开始培训强化学习模式往往会因为长时间和繁琐地收集物剂与环境相互作用而受到影响。相反,利用专家示范来指导RL代理物可以提高样本效率并改进最终趋同。为了更好地将专家纳入政策性RL模型,我们提议了一个基于行为方-批评算法的示范学习通用框架。技术上,我们首先使用K-Means群集来评价抽样探索与演示数据的相似性。然后,我们通过修改梯度更新战略来利用演示手段,增加类似框架中采取行动的可能性。我们在Mujoco和2个自设计的机器人环境中进行了4个标准基准环境的实验。结果显示,在某些条件下,我们的算法可以提高样本效率20%~40%。通过将我们的框架与政策性算法相结合,RL模型可以加快趋同速度,并获得更好的最终中值附加值,特别是在复杂的机器人互动费用昂贵的情况下。