We investigate boosted ensemble models for off-policy learning from logged bandit feedback. Toward this goal, we propose a new boosting algorithm that directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a "weak" learning condition is satisfied. We further show how the base learner reduces to standard supervised learning problems. Experiments indicate that our algorithm can outperform deep off-policy learning and methods that simply regress on the observed rewards, thereby demonstrating the benefits of both boosting and choosing the right learning objective.
翻译:我们从记录强盗反馈中调查了用于从政策外学习的强化组合模型。 为实现这一目标, 我们提议了一种新的促进算法, 直接优化对政策预期回报的估计。 我们分析了这一算法, 并证明每轮推力都会降低( 可能指数性快速), 只要满足“ 弱” 学习条件。 我们进一步展示了基础学习者如何减少标准监管的学习问题。 实验显示,我们的算法可以超越深度的退出政策学习, 以及简单地在观察到的回报上倒退的方法, 从而展示了提振和选择正确学习目标的好处 。