Any reinforcement learning system must be able to identify which past events contributed to observed outcomes, a problem known as credit assignment. A common solution to this problem is to use an eligibility trace to assign credit to recency-weighted set of experienced events. However, in many realistic tasks, the set of recently experienced events are only one of the many possible action events that could have preceded the current outcome. This suggests that reinforcement learning can be made more efficient by allowing credit assignment to any viable preceding state, rather than only those most recently experienced. Accordingly, we examine ``Predecessor Features'', the fully bootstrapped version of van Hasselt's ``Expected Trace'', an algorithm that achieves this richer form of credit assignment. By maintaining a representation that approximates the expected sum of past occupancies, this algorithm allows temporal difference (TD) errors to be propagated accurately to a larger number of predecessor states than conventional methods, greatly improving learning speed. The algorithm can also be naturally extended from tabular state representation to feature representations allowing for increased performance on a wide range of environments. We demonstrate several use cases for Predecessor Features and compare its performance with other approaches.
翻译:任何强化学习系统都必须能够确定哪些过去的事件促成了观察到的结果,即所谓的信用分配问题。这个问题的一个共同解决办法是使用资格追踪来分配信用,以支付一系列累累事件。然而,在许多现实的任务中,最近经历的一系列事件仅仅是在目前结果之前可能发生的许多可能的行动事件之一。这表明,如果允许将信用转让给任何具有生存能力的先前国家,而不是仅仅允许最近经历的情况,那么强化学习可以提高效率。因此,我们审查“先期状况特征”,即范哈塞尔特的“特快路径”的完整版本,即实现这种较丰富形式信用分配的算法。通过保持一种接近过去偏好情况的预期总和,这种算法允许将时间差(TD)错误准确地传播给更多的被继承国,而不是常规方法,大大提高学习速度。算法也可以自然地从“先期状态表现”扩展为特征表现,允许在广泛的环境中提高性能。我们用了一些例子来说明前期状况,并将它与其他方法进行比较。