Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments.
翻译:最近分批(脱线)强化学习的进展显示,从现有离线数据中学习的有希望的成果,并证明脱线强化学习是无模式环境下学习控制政策的一个重要工具。对于由亚最佳非学习型算法收集的数据集,应用了离线强化学习算法,这一算法可以产生一种比收集数据时所用行为代理器效果强的政策。这种情景在机器人中很常见,因为现有自动化正在收集业务数据。虽然离线学习技术可以从一个亚最佳行为代理物生成的数据中学习,但目前仍有机会通过在培训过程中战略性地引入人类演示数据来改进现有离线强化学习算法的样本复杂性。为此,我们提出一种新的方法,利用不确定性估计来触发输入人类演示数据,并指导政策培训走向最佳行为,同时降低总体抽样复杂性。我们的实验表明,如果将专家数据与从一个亚最佳行为代理物中收集的数据结合起来,则这种方法比较简单,那么这种方法比较有效。我们加强了现有的离线强化从外部学习的调控算法与我们的方法和从 MuJoCo世界学习的数据的实验。