Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification, especially when the environment's dynamics are only partially known. This paper proposes a novel pipeline for learning non-Markovian task specifications as succinct finite-state `task automata' from episodes of agent experience within unknown environments. We leverage two key algorithmic insights. First, we learn a product MDP, a model composed of the specification's automaton and the environment's MDP (both initially unknown), by treating it as a partially observable MDP and using off-the-shelf algorithms for hidden Markov models. Second, we propose a novel method for distilling the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our learnt task automaton enables the decomposition of a task into its constituent sub-tasks, which improves the rate at which an RL agent can later synthesise an optimal policy. It also provides an interpretable encoding of high-level environmental and task features, so a human can readily verify that the agent has learnt coherent tasks with no misspecifications. In addition, we take steps towards ensuring that the learnt automaton is environment-agnostic, making it well-suited for use in transfer learning. Finally, we provide experimental results to illustrate our algorithm's performance in different environments and tasks and its ability to incorporate prior domain knowledge to facilitate more efficient learning.
翻译:使用 scalar 奖励 信号的培训强化学习( RL) 代理商 使用 scalar 奖励 信号在环境稀少和非 Markovian 奖赏时往往不可行。 此外, 在培训前手工制作这些奖赏功能容易被误化, 特别是当环境动态仅部分为已知时。 本文提出一个新的管道, 用于学习非 Markovian 任务规格, 即根据未知环境中的代理商经验, 简洁的 限定国家 `task 自动地图' 。 我们利用两种关键的算法洞察力。 首先, 我们学习了一种产品 MDP, 这是一种由规格的自动地图和环境 MDP 构成的模型( 最初都是未知的), 通过将它作为部分可观测的 MDP 并使用现成的算法 。 其次, 我们提出一种新颖的方法, 将任务( 被总结为确定性的) 。 我们所学的自动图能让任务分解, 将一个任务分解成一个模型( 最初为未知的), 使一个RL 代理商能够稍后合成的缩缩缩缩成一个最佳的模型, 将它能将一个最佳的轨迹化为最终的操作, 。