In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards. While the latter is more popular, it is problematic to apply in environments without an inherent notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward criterion, which is discounting-free. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for applying a family of discounting-free optimality criteria (Veinott, 1969) to RL.
翻译:在强化学习(RL)中,目标是获得最佳政策,其最佳标准具有根本重要性。两个主要最佳标准是平均和折扣奖励。虽然后者比较普遍,但在没有固有的折扣概念的环境中适用却成问题。这促使我们重新审视:(a) 动态方案拟订中最佳标准的进展;(b) 人为贴现因素的理由和复杂性;(c) 直接尽量扩大平均奖励标准的好处,这是无折扣的。我们的贡献包括彻底审查平均和折扣奖励之间的关系,以及在RL中讨论其利弊。我们强调,平均奖励RL方法具有将无折扣最佳标准(Viinott,1969年)适用于RL的成分和机制。