Deep reinforcement learning (DRL) provides a new way to generate robot control policy. However, the process of training control policy requires lengthy exploration, resulting in a low sample efficiency of reinforcement learning (RL) in real-world tasks. Both imitation learning (IL) and learning from demonstrations (LfD) improve the training process by using expert demonstrations, but imperfect expert demonstrations can mislead policy improvement. Offline to Online reinforcement learning requires a lot of offline data to initialize the policy, and distribution shift can easily lead to performance degradation during online fine-tuning. To solve the above problems, we propose a learning from demonstrations method named A-SILfD, which treats expert demonstrations as the agent's successful experiences and uses experiences to constrain policy improvement. Furthermore, we prevent performance degradation due to large estimation errors in the Q-function by the ensemble Q-functions. Our experiments show that A-SILfD can significantly improve sample efficiency using a small number of different quality expert demonstrations. In four Mujoco continuous control tasks, A-SILfD can significantly outperform baseline methods after 150,000 steps of online training and is not misled by imperfect expert demonstrations during training.
翻译:深度强化学习(DRL)提供了产生机器人控制政策的新途径。然而,培训控制政策的过程需要长期的探索,导致在现实世界的任务中强化学习(RL)的样本效率较低。模仿学习(IL)和从演示中学习(LfD)都通过专家演示来改进培训过程,但不完善的专家演示可以误导政策改进。在线强化学习需要大量离线数据来启动政策,而分配转移很容易导致在线微调期间的性能退化。为了解决上述问题,我们建议从名为A-SILFD的演示方法中学习,这种方法将专家演示视为代理人的成功经验,并利用经验来限制政策改进。此外,我们防止业绩退化,因为全方位功能在估算功能中存在重大错误。我们的实验显示,A-SILfD可以使用少量不同质量专家演示来显著提高样本效率。在4个 Mujoco连续控制任务中,A-SILFD在15万个在线培训步骤后,可以大大超出基准方法。我们没有被不完善的专家培训错误。