We study the link between generalization and interference in temporal-difference (TD) learning. Interference is defined as the inner product of two different gradients, representing their alignment. This quantity emerges as being of interest from a variety of observations about neural networks, parameter sharing and the dynamics of learning. We find that TD easily leads to low-interference, under-generalizing parameters, while the effect seems reversed in supervised learning. We hypothesize that the cause can be traced back to the interplay between the dynamics of interference and bootstrapping. This is supported empirically by several observations: the negative relationship between the generalization gap and interference in TD, the negative effect of bootstrapping on interference and the local coherence of targets, and the contrast between the propagation rate of information in TD(0) versus TD($\lambda$) and regression tasks such as Monte-Carlo policy evaluation. We hope that these new findings can guide the future discovery of better bootstrapping methods.
翻译:干涉被定义为两种不同梯度的内在产物,代表着它们的一致性。从对神经网络、参数共享和学习动态的各种观察中可以看出,这种数量引起了人们的兴趣。我们发现,TD很容易导致低干扰、低超度参数,而在监督的学习中效果似乎相反。我们假设,原因可以追溯到干扰和靴子穿梭的动态之间的相互作用。这从若干观察中得到了经验支持:通用差距和梯子的干扰之间的负面关系、靴子对干扰的消极影响以及目标的当地一致性,以及TD(0)和TD($\lambda$)的信息传播率与蒙特-卡尔洛政策评价等回归任务之间的对比。我们希望这些新的发现能够指导今后发现更好的靴子穿鞋方法。