部分观测环境方案综合指导强化学习 (Program Synthesis Guided Reinforcement Learning for Partially Observed Environments)

A key challenge for reinforcement learning is solving long-horizon planning problems. Recent work has leveraged programs to guide reinforcement learning in these settings. However, these approaches impose a high manual burden on the user since they must provide a guiding program for every new task. Partially observed environments further complicate the programming task because the program must implement a strategy that correctly, and ideally optimally, handles every possible configuration of the hidden regions of the environment. We propose a new approach, model predictive program synthesis (MPPS), that uses program synthesis to automatically generate the guiding programs. It trains a generative model to predict the unobserved portions of the world, and then synthesizes a program based on samples from this model in a way that is robust to its uncertainty. In our experiments, we show that our approach significantly outperforms non-program-guided approaches on a set of challenging benchmarks, including a 2D Minecraft-inspired environment where the agent must complete a complex sequence of subtasks to achieve its goal, and achieves a similar performance as using handcrafted programs to guide the agent. Our results demonstrate that our approach can obtain the benefits of program-guided reinforcement learning without requiring the user to provide a new guiding program for every new task.

翻译：强化学习的关键挑战是解决长期的横向规划问题。最近的工作利用了程序来指导在这些环境中的强化学习。然而,这些方法给用户带来了很高的手工操作负担,因为它们必须为每一项新任务提供指导程序。部分观测到的环境使方案编制任务更加复杂,因为方案必须执行正确和最理想地处理各种环境隐蔽区域可能配置的战略。我们提出了一个新的方法,即模型预测程序合成(MPPS),利用程序合成自动生成指导程序。它训练了一个基因化模型,以预测世界未观测的部分,然后以这种模型样本为基础,以稳健地应对其不确定性的方式合成一个程序。我们在实验中显示,我们的方法大大超越了一套具有挑战性的基准的非方案指导方法,其中包括一个2D矿产品驱动的环境,在这个环境中,代理人必须完成一个复杂的子任务序列来实现其目标,并实现类似于使用手工制作的程序来指导代理人。我们的成果表明,我们的方法可以获得程序指导每个用户学习新任务而无需提供新任务的程序指导方案的好处。