Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible -- to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.
翻译:强化学习( RL) 方法通常将奖赏功能作为黑盒处理。 因此, 这些方法必须与环境广泛互动, 以发现奖赏和最佳政策。 然而, 在大多数RL 应用中, 用户必须制定奖赏功能, 从而有机会让奖赏功能可见 -- -- 向 RL 代理显示奖赏函数的代码, 以便让RL 代理显示奖赏函数的代码, 以便它能够利用该函数的内部结构, 以更具样本效率的方式学习最佳政策。 在本文中, 我们用两个步骤来展示如何实现这一理念。 首先, 我们提出奖赏机器, 一种支持奖赏功能的限定状态机器, 支持奖赏功能的指定, 同时暴露奖赏功能的结构结构。 我们然后描述利用这一结构支持学习的不同方法, 包括自动的奖赏塑造、 任务分解和反事实推论与政策外学习。 在不同任务和RL 代理中, 在表格和连续的领域进行实验, 展示在样本效率和结果政策质量方面利用奖赏结构结构的好处。 最后, 我们提出奖赏机器是一种有限的国家机器, 具有一种表达常规语言的明示力力力力力,, 以及作为典型、 时间级和逻辑的不定期的逻辑分级、 。