Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.
翻译:强化学习(RL)为从经验中学习控制政策提供了一个有吸引力的正规主义。然而,典型的RL的老式积极制定过程要求对每一种行为都进行长时间的积极探索,使得很难在机器人控制等现实世界环境中应用。如果我们允许RL算法有效地使用先前收集的数据,以帮助在线学习过程,那么这种应用可以变得更加实际得多:先前的数据将提供一个起点,减轻探索和抽样复杂性带来的挑战,而在线培训则使代理商能够完善所需的技能。这种先前的数据既可以是专家演示,也可以是显示潜在有用过渡的专家优化的前期数据。 虽然以前的一些方法或者使用了最佳的演示,如机器人控制(RL),或者使用了亚最佳的数据来纯粹在离线培训,但是仍然非常难以用离线数据来培训一项政策,而且实际上继续通过在线RL来进一步改进。 在本文中,我们分析为什么这一问题如此具有挑战性,并提出一种将高效的动态程序与尽可能更新的政策结合起来,提供一个简单有效的框架,能够利用大量离线数据陷阱的过渡过渡过程,同时,我们可以快速地将实时数据与在线数据进行升级的升级,然后显示我们之前的升级的升级的升级的升级技术,我们可以展示一个实时的升级的升级的模型,从而显示我们所需的实时的升级的升级的升级的系统。