Recent advance in deep offline reinforcement learning (RL) has made it possible to train strong robotic agents from offline datasets. However, depending on the quality of the trained agents and the application being considered, it is often desirable to fine-tune such agents via further online interactions. In this paper, we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL. To address this issue, we first propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset. Furthermore, we leverage multiple Q-functions trained pessimistically offline, thereby preventing overoptimism concerning unfamiliar actions at novel states during the initial training phase. We show that the proposed method improves sample-efficiency and final performance of the fine-tuned robotic agents on various locomotion and manipulation tasks. Our code is available at: https://github.com/shlee94/Off2OnRL.
翻译:近期在深离网强化学习(RL)方面的进展使得能够从离线数据集中培训强大的机器人剂成为可能。然而,根据经过培训的代理商的质量和正在考虑的应用,往往有必要通过进一步的在线互动对此类代理商进行微调。在本文中,我们注意到,州-行动分布的转变可能导致在微调过程中造成严重的靴套错误,从而破坏通过离线RL获得的良好初始政策。为了解决这一问题,我们首先提议一个平衡的重播计划,优先处理在网上发现的样本,同时鼓励使用离线数据集的近离线政策样本。此外,我们利用经过培训的多功能悲观脱机,从而防止在初始培训阶段对新州不熟悉的行为过于乐观。我们表明,拟议的方法提高了样品效率和微调机器人剂在各种 Locomotion和操作任务上的最后性能。我们的代码可以在https://github.com/shlee94/OD2OnRL上查到。