The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence. Eligibility traces enable efficient credit assignment to the recent sequence of states and actions experienced by the agent, but not to counterfactual sequences that could also have led to the current state. In this work, we introduce expected eligibility traces. Expected traces allow, with a single update, to update states and actions that could have preceded the current state, even if they did not do so on this occasion. We discuss when expected traces provide benefits over classic (instantaneous) traces in temporal-difference learning, and show that sometimes substantial improvements can be attained. We provide a way to smoothly interpolate between instantaneous and expected traces by a mechanism similar to bootstrapping, which ensures that the resulting algorithm is a strict generalisation of TD($\lambda$). Finally, we discuss possible extensions and connections to related ideas, such as successor features.
翻译:如何确定哪些州和行动对某种结果负有责任的问题被称为信用分配问题,仍然是强化学习和人工智能方面的一个核心研究问题。资格追踪使得能够对代理人经历的最近一系列州和行动进行有效的信用分配,但不能对也可能导致当前状态的近期一系列州和行动进行反事实排序。在这项工作中,我们引入了预期资格的跟踪。预期的跟踪允许以单一的更新方式更新本可在当前状态之前更新的国家和行动,即使它们没有这样做。我们讨论预期的痕迹何时能给时间差异学习的经典(即时)痕迹带来好处,并表明有时可以取得实质性的改进。我们提供了一种途径,通过类似于靴式的机制顺利地将瞬间和预期的痕迹相互牵引,确保由此产生的算法能够严格概括TD($\lambda$),最后,我们讨论是否可能延期和与相关想法的联系,例如后续特征。