Any reinforcement learning system must be able to identify which past events contributed to observed outcomes, a problem known as credit assignment. A common solution to this problem is to use an eligibility trace to assign credit to recency-weighted set of experienced events. However, in many realistic tasks, the set of recently experienced events are only one of the many possible action events that could have preceded the current outcome. This suggests that reinforcement learning can be made more efficient by allowing credit assignment to any viable preceding state, rather than only those most recently experienced. Accordingly, we propose "Predecessor Features", an algorithm that achieves this richer form of credit assignment. By maintaining a representation that approximates the expected sum of past occupancies, our algorithm allows temporal difference (TD) errors to be propagated accurately to a larger number of predecessor states than conventional methods, greatly improving learning speed. Our algorithm can also be naturally extended from tabular state representation to feature representations allowing for increased performance on a wide range of environments. We demonstrate several use cases for Predecessor Features and contrast its performance with other similar approaches.
翻译:任何强化学习系统都必须能够确定哪些过去的事件促成了观察到的结果,即所谓的信用分配问题。这个问题的一个共同解决办法是使用资格追踪来分配信用,以顾及一系列经验丰富的事件。然而,在许多现实的任务中,最近经历的一系列事件仅仅是在目前结果之前可能发生的许多可能的行动事件之一。这表明,通过允许将信用转让给任何有生存能力的先前国家,而不是仅仅允许最近经历的情况,可以提高强化学习的效率。因此,我们建议采用“先前的特性”,一种实现这种更丰富形式的信用分配的算法。通过保持一种接近预期的过去占用总量的表示,我们的算法允许将时间差(TD)错误准确地传播到比常规方法更多的先前国家,大大提高学习速度。我们的算法也可以自然地从表状态表述扩展为特征表现,从而在广泛的环境中提高性能。我们演示了“前的特性”和将其性能与其他类似方法作对比的例子。