Many reinforcement learning algorithms rely on value estimation. However, the most widely used algorithms -- namely temporal difference algorithms -- can diverge under both off-policy sampling and nonlinear function approximation. Many algorithms have been developed for off-policy value estimation which are sound under linear function approximation, based on the linear mean-squared projected Bellman error (PBE). Extending these methods to the non-linear case has been largely unsuccessful. Recently, several methods have been introduced that approximate a different objective, called the mean-squared Bellman error (BE), which naturally facilities nonlinear approximation. In this work, we build on these insights and introduce a new generalized PBE, that extends the linear PBE to the nonlinear setting. We show how this generalized objective unifies previous work, including previous theory, and obtain new bounds for the value error of the solutions of the generalized objective. We derive an easy-to-use, but sound, algorithm to minimize the generalized objective which is more stable across runs, is less sensitive to hyperparameters, and performs favorably across four control domains with neural network function approximation.
翻译:许多强化学习算法依赖价值估算。 但是,最广泛使用的算法(即时间差异算法)可以在离政策抽样和非线性函数近似法下出现差异。 许多算法是为线性函数近似法下的离政策值估算而开发的,这些算法在线性函数近似法下是健康的。 将这些方法推广到非线性案例基本上没有成功。 最近, 引入了几种方法, 近似于一个不同的目标, 称为“ 平均平方” Bellman 错误( BE ), 即自然安装非线性近似设备。 在这项工作中, 我们借助这些洞见并引入一个新的通用 PBE, 将线性 PBE 扩展至非线性函数设置。 我们展示了这个通用目标如何统一以前的工作, 包括以前的理论, 并获得通用目标解决方案的价值错误的新界限。 我们从一种简单易用但健全的算法, 以尽量减少整个运行更稳定的通用目标, 对超光度不敏感, 并且以线性网络函数近似的方式在四个控制区域进行有利于执行。