Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
翻译:时间差异(TD)学习是估计某项政策的价值功能的一个普遍和非常有用的工具,而后者反过来又需要找到良好的政策。一般而言,TD学习更新可以说明何时访问该政策。当一个国家的代理人土地时,其价值可以用来计算TD-error,然后将其传播到其他国家。然而,在计算更新时,也许有意思的是,考虑到除是否访问某个国家之外的其他信息。例如,一些国家可能比其他国家更重要(例如,在成功的轨迹中经常看到的国家)。或者,一些国家的估值可能不可靠(例如,由于部分不易遵守或缺乏数据),因此其价值作为目标不可取。我们建议采用一种方法,在输入时和提供更新目标时,对TD更新中使用的国家进行重新加权。我们证明,我们的方法与线性功能近似一致,并表明与其他TD式方法相比,它可取的经验行为。