A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
翻译:在现实世界强化学习(RL)中,一个重大挑战是回报反馈的广度。通常,现有的是一种直观但少见的奖励功能,只能表明任务是否部分完成或完全完成。然而,缺乏精心设计的细粒反馈意味着大多数现有的RL算法在合理的时间框架内无法学习可接受的政策。这是因为政策在获得任何有用的反馈之前必须执行的大量探索行动。在这项工作中,我们通过开发一种算法来解决这个具有挑战性的问题,该算法利用了通过亚最佳行为政策生成的离线演示数据,从而在如此稀少的奖励环境下,快速高效的在线 RL 运行轨迹。我们称之为“在线学习”的“指导离线(LOGO)算法”的拟议算法,通过使用离线演示数据,将政策改进的步骤与额外的政策指导结合起来。关键的想法是,通过从离线数据中获取更多的指导,LOGO或以亚优政策的方式调整其政策,同时能够学习超标的离轨率,同时能够学习超标度和接近的在线观测轨迹轨迹。我们还可以在每一个不完善的轨迹上进行理论分析,我们学习一个不完全的轨数据。