目标条件强化学习的一阶表示语言 (First-Order Representation Languages for Goal-Conditioned RL)

First-order relational languages have been used in MDP planning and reinforcement learning (RL) for two main purposes: specifying MDPs in compact form, and representing and learning policies that are general and not tied to specific instances or state spaces. In this work, we instead consider the use of first-order languages in goal-conditioned RL and generalized planning. The question is how to learn goal-conditioned and general policies when the training instances are large and the goal cannot be reached by random exploration alone. The technique of Hindsight Experience Replay (HER) provides an answer to this question: it relabels unsuccessful trajectories as successful ones by replacing the original goal with one that was actually achieved. If the target policy must generalize across states and goals, trajectories that do not reach the original goal states can enable more data- and time-efficient learning. In this work, we show that further performance gains can be achieved when states and goals are represented by sets of atoms. We consider three versions: goals as full states, goals as subsets of the original goals, and goals as lifted versions of these subgoals. The result is that the latter two successfully learn general policies on large planning instances with sparse rewards by automatically creating a curriculum of easier goals of increasing complexity. The experiments illustrate the computational gains of these versions, their limitations, and opportunities for addressing them.

翻译：一阶关系语言在MDP规划与强化学习（RL）中主要有两个用途：以紧凑形式描述MDP，以及表示和学习具有通用性、不依赖于特定实例或状态空间的策略。本研究转而探讨一阶语言在目标条件强化学习与广义规划中的应用。核心问题在于：当训练实例规模庞大且目标无法仅通过随机探索达成时，如何学习具有目标条件性与通用性的策略。后见经验回放（HER）技术为此提供了解决方案：该技术通过将实际达成的目标替换原始目标，将失败轨迹重新标注为成功轨迹。若目标策略需在状态与目标间保持泛化能力，未达成原始目标状态的轨迹反而能实现更高数据与时间效率的学习。本文证明，当状态与目标以原子集合表示时，可进一步获得性能提升。我们研究了三种变体：将目标视为完整状态、将目标视为原始目标的子集，以及将目标视为这些子目标的提升形式。结果表明，后两种变体能通过自动创建复杂度递增的简易目标课程，在具有稀疏奖励的大规模规划实例上成功学习通用策略。实验验证了这些变体的计算效率优势，同时揭示了其局限性及改进方向。