Value approximation using deep neural networks is at the heart of off-policy deep reinforcement learning, and is often the primary module that provides learning signals to the rest of the algorithm. While multi-layer perceptron networks are universal function approximators, recent works in neural kernel regression suggest the presence of a spectral bias, where fitting high-frequency components of the value function requires exponentially more gradient update steps than the low-frequency ones. In this work, we re-examine off-policy reinforcement learning through the lens of kernel regression and propose to overcome such bias via a composite neural tangent kernel. With just a single line-change, our approach, the Fourier feature networks (FFN) produce state-of-the-art performance on challenging continuous control domains with only a fraction of the compute. Faster convergence and better off-policy stability also make it possible to remove the target network without suffering catastrophic divergences, which further reduces TD}(0)'s estimation bias on a few tasks.
翻译:使用深神经网络的值近似值是政策外深层强化学习的核心,而且常常是向算法其余部分提供学习信号的主要模块。 虽然多层光谱网络是通用功能近似器,但最近神经内核回归的工程表明存在光谱偏差,其中值函数中适当的高频组件需要比低频功能更大幅度的梯度更新步骤。 在这项工作中,我们从内核回归的镜头中重新审查政策外强化学习,并提议通过复合神经透镜来克服这种偏差。只要单线改变,我们的方法是四面形特征网络在挑战持续控制域上产生最先进的性能,只有一小部分精度。更快的趋同和更好的离政策稳定性也使得有可能在不遭受灾难性差异的情况下消除目标网络,这进一步降低了TD}(0)对少数任务的估计偏差。