Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or, more generally, sub-optimal prior data that illustrates potentially useful transitions. But it remains difficult to train a policy with potentially sub-optimal offline data and improve it further with online RL. In this paper we systematically analyze why this problem is so challenging, and propose an algorithm that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience.
翻译:强化学习(RL)为从经验中学习控制政策提供了一个有吸引力的正规主义经验。然而,典型的积极制定RL的经典积极制定过程需要对每一种行为进行长时间的积极探索,使得很难在机器人控制等现实世界环境中应用。如果我们能够允许RL算法有效地使用先前收集的数据,以帮助在线学习过程,那么这些应用可以更加实际得多:先前的数据将提供一个起点,减轻由于勘探和抽样复杂性而产生的挑战,而在线培训则使代理商能够完善所需的技能。这种先前的数据既可以是专家演示,也可以是更一般而言,说明潜在有益的转变的次级最佳先前数据。但是,仍然很难用潜在的亚最佳离线数据来培训一项政策,并用在线RL来进一步改进它。 在这份文件中,我们系统地分析这一问题如此具有挑战性的原因,并提出一种将抽样高效动态方案规划与最大可能的政策更新结合起来的算法,提供一个简单有效的框架,能够利用大量离线数据,然后迅速对RL政策进行在线微调。我们展示了我们的方法、优势的加权的在线分析师和先期技能组合。