Despite seminal advances in reinforcement learning in recent years, many domains where the rewards are sparse, e.g. given only at task completion, remain quite challenging. In such cases, it can be beneficial to tackle the task both from its beginning and end, and make the two ends meet. Existing approaches that do so, however, are not effective in the common scenario where the strategy needed near the end goal is very different from the one that is effective earlier on. In this work we propose a novel RL approach for such settings. In short, we first train a backward-looking agent with a simple relaxed goal, and then augment the state representation of the forward-looking agent with straightforward hint features. This allows the learned forward agent to leverage information from backward plans, without mimicking their policy. We demonstrate the efficacy of our approach on the challenging game of Sokoban, where we substantially surpass learned solvers that generalize across levels, and are competitive with SOTA performance of the best highly-crafted systems. Impressively, we achieve these results while learning from a small number of practice levels and using simple RL techniques.
翻译:尽管近年来在强化学习方面取得了显著进步,但许多回报稀少的领域(例如,只是任务完成时才给予的)仍然相当具有挑战性。在这样的情况下,从任务开始和结束阶段处理任务,并使两个目标相匹配,可能是有益的。但是,在共同的情景中,这样做的现有办法并不有效,因为接近最终目标所需的战略与以前有效的战略大不相同。在这项工作中,我们建议为这种环境采用新的RL方法。简而言之,我们首先训练一个具有简单放松目标的后视代理,然后增加具有直截了当提示特征的前瞻性代理的状态代表。这样可以使先行代理在不模仿其政策的情况下利用落后计划的信息。我们展示了我们在具有挑战性的Sokoban游戏上的做法的功效,在这个游戏中,我们大大超越了整个层次所学的、与SOTA最精细的系统业绩具有竞争力的解决方案。令人印象深刻的是,我们从少数实践层面学习这些成果,并使用简单的RL技术。