Two common approaches to sequential decision-making are AI planning (AIP) and reinforcement learning (RL). Each has strengths and weaknesses. AIP is interpretable, easy to integrate with symbolic knowledge, and often efficient, but requires an up-front logical domain specification and is sensitive to noise; RL only requires specification of rewards and is robust to noise but is sample inefficient and not easily supplied with external knowledge. We propose an integrative approach that combines high-level planning with RL, retaining interpretability, transfer, and efficiency, while allowing for robust learning of the lower-level plan actions. Our approach defines options in hierarchical reinforcement learning (HRL) from AIP operators by establishing a correspondence between the state transition model of AI planning problem and the abstract state transition system of a Markov Decision Process (MDP). Options are learned by adding intrinsic rewards to encourage consistency between the MDP and AIP transition models. We demonstrate the benefit of our integrated approach by comparing the performance of RL and HRL algorithms in both MiniGrid and N-rooms environments, showing the advantage of our method over the existing ones.
翻译:连续决策的两种共同方法是AI规划(AI)和强化学习(RL),两者都有长处和短处。AIP是可解释的,容易与象征性知识融合,而且往往效率高,但需要先行逻辑领域规格,对噪音敏感;RL只要求说明奖励,对噪音具有强力,但抽样低效,不容易提供外部知识。我们建议一种综合方法,将高级别规划与RL相结合,保留可解释性、转让和效率,同时允许对低级计划行动进行有力的学习。我们的方法界定了AIP操作者在从AIP操作者那里进行等级强化学习(HRL)的备选办法,方法是在AI规划问题的国家过渡模式与Markov决定程序(MDP)的抽象国家过渡系统之间建立对应关系。通过增加内在奖励,鼓励MDP和AIP过渡模式之间的一致性,可以学到各种选择。我们通过比较MiniGrid和N-room环境中RL算法和HRL算法的绩效,显示我们方法优于现有方法的优势,从而证明我们综合方法的好处。