Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or, more generally, sub-optimal prior data that illustrates potentially useful transitions. But it remains difficult to train a policy with potentially sub-optimal offline data and improve it further with online RL. In this paper we systematically analyze why this problem is so challenging, and propose an algorithm that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on a variety of simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.
翻译:强化学习(RL)为从经验中学习控制政策提供了一个有吸引力的正规主义。然而,典型的RL的老式积极制定过程要求对每一种行为进行长时间的积极探索,使得很难在机器人控制等现实世界环境中应用。如果我们能够允许RL算法有效地使用先前收集的数据,以帮助在线学习进程,那么这些应用可以更加实际得多:先前的数据将提供一个起点,减轻由于勘探和抽样复杂性而产生的挑战,而在线培训则使代理商能够完成所需要的技能。这种先前的数据既可以是专家演示,也可以是更一般而言,表明潜在有用范围的转折的亚最佳先前数据。但是,仍然难以用在线RL来培训可能达到亚最佳离线数据并进一步改进该政策的政策。 在本文件中,我们系统地分析这一问题如此具有挑战性的原因,并提出一种将抽样效率高的动态方案规划与最大可能更新的政策相结合的算法,提供一个简单有效的框架,能够利用大量离线数据,然后迅速对RL政策进行在线调整。我们展示了方法、优势的精细的内置的内脏数据,并能够展示我们之前的机机机操作的机操作技术,我们可以快速地学习一个真实的机能的机变的模型化的模型学。