A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal {policy}, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
翻译:在现实世界强化学习(RL)中,一个重大挑战是回报反馈的广度。通常,我们可以利用的是一种直观但稀少的奖励功能,它只能表明任务是否部分完成或完全完成。然而,由于缺乏精心设计的细粒反馈,意味着大多数现有的RL算法在合理的时间框架内无法学习可接受的政策。这是因为政策在获得任何有用的反馈之前必须执行的大量探索行动。在这项工作中,我们通过开发一种算法来解决这个具有挑战性的问题,该算法利用了在如此稀少的奖励环境中快速高效的在线流动 RL 的亚最佳行为政策所产生的离线演示数据。我们称之为“在线学习”的“指导离线(LOGO)算法”的拟议算法,通过使用离线演示数据,将政策改进的步骤与额外的政策指导结合起来。关键的想法是,通过从离线数据中获取指导,LOGO或调整其政策,以我们亚最佳的运行率 {PLO}的方式,同时能够通过超前的高级观察方法来学习更精确的运行过程。我们还在每一个不完善的轨算法中,我们提供了一个理论上更精确的升级的运行。