We present and analyze the Krylov-Bellman Boosting (KBB) algorithm for policy evaluation in general state spaces. It alternates between fitting the Bellman residual using non-parametric regression (as in boosting), and estimating the value function via the least-squares temporal difference (LSTD) procedure applied with a feature set that grows adaptively over time. By exploiting the connection to Krylov methods, we equip this method with two attractive guarantees. First, we provide a general convergence bound that allows for separate estimation errors in residual fitting and LSTD computation. Consistent with our numerical experiments, this bound shows that convergence rates depend on the restricted spectral structure, and are typically super-linear. Second, by combining this meta-result with sample-size dependent guarantees for residual fitting and LSTD computation, we obtain concrete statistical guarantees that depend on the sample size along with the complexity of the function class used to fit the residuals. We illustrate the behavior of the KBB algorithm for various types of policy evaluation problems, and typically find large reductions in sample complexity relative to the standard approach of fitted value iterationn.
翻译:我们提出并分析用于一般状态空间政策评估的Krylov-Bellman Boussting(KBB)算法。 它介于使用非参数回归法(如推进法)来安装Bellman剩余值, 并通过使用最小方位时间差(LSTD)程序来估计价值函数, 其特征集集成后会随着时间增长。 通过利用与Krylov方法的连接, 我们为这种方法配备了两种有吸引力的保证。 首先, 我们提供了一个总的趋同约束, 允许在剩余装配和LSTD计算中出现不同的估计错误。 与我们的数字实验一致, 这一约束显示合并率取决于限制的光谱结构, 并且通常是超线性。 其次, 通过将这一元结果与残余装配和LSTD计算所需的样本大小保证结合起来, 我们获得了具体的统计保证, 取决于样本大小以及用于匹配残余物的功能类别的复杂性。 我们展示了KBB计算方法在各种政策评估问题中的行为, 并且通常发现与匹配值的标准方法相比, 样本复杂性会大大降低。