In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.
翻译:在持续的任务中,平均奖励强化学习可能比更常见的折扣奖励模式更适于解决问题。与往常一样,在这一背景下学习最佳政策通常需要大量的培训经验。奖励是将域知识纳入强化学习以加快趋同于最佳政策的共同方法。然而,根据我们所知,迄今只有打折扣的环境才确定了创造奖励的理论属性。本文件为平均奖励学习提供了第一个奖赏形成框架,并证明根据标准假设,可以恢复原奖励功能下的最佳政策。为避免人工构建塑造功能的需要,我们采用了一种方法,利用以时间逻辑公式表示的域知识。该公式自动转化为一种形成功能,在整个学习过程中提供额外的奖励。我们评估了三种持续任务的拟议方法。在所有情况下,在与相关基线相比,提高平均奖励学习率的同时,不减少所学政策的执行情况。