Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes the reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives q-functions from the current hypothesis reward machine, and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long. The experiments show that learning high-level knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as q-learning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.
翻译:包含高级知识是加快强化学习(RL)的有效方法, 特别是对于报酬稀少的复杂任务。 我们调查了一个RL问题, 高水平知识以奖励机器的形式出现, 即一种将奖励功能编码的Mealy机器。 我们侧重于一种环境, 这种知识是学习机构无法事先获得的。 我们开发了一种迭代算法, 对奖励机器和政策进行联合推断( 更具体地说, q- 学习 ) 。 在每一次迭代法中, 算法保持一个假说奖赏机器和RL事件样本。 它从目前的假说奖赏机器中产生Q- 功能, 并运行RL 来更新 q- 功能。 在进行 RL 时, 算法会更新样本, 将获得的奖赏与当前假说奖赏机器的奖赏相不符。 在下一个迭代法中, 算出一个新的假说奖赏奖赏奖赏机器的对等法 。 我们根据我们定义的奖赏机器的对等关系, 将最高奖赏等级政策转换为q- 最高奖赏机器对最高奖赏标准, 在连续的奖状中, 最高级的奖状中, 我们的奖赏策略中, 将显示为最高级的奖赏的奖赏策略为最优的顺序为最优的顺序, 。