In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for developing the general discounting-free optimality criterion (Veinott, 1969) in RL.
翻译:在强化学习(RL)中,目标是获得最佳政策,对于这种政策,最佳性标准具有根本重要性;两个主要最佳性标准是平均和贴现奖励,后者通常被视为与前者近似;虽然贴现奖励比较普遍,但在没有自然贴现概念的环境中适用折扣奖励有问题;这促使我们重新考虑:(a) 动态方案拟订中最佳性标准的演变;(b) 人为贴现因素的理由和复杂性;(c) 直接尽量扩大平均奖励的好处。我们的贡献包括彻底审查平均和贴现奖励之间的关系,以及在RL中讨论其利弊。我们强调,平均贴现的RL方法具有制定一般无贴现最佳性标准(Veinott,1969年)的要素和机制。