We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a ''weak'' learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.
翻译:我们提出了第一个针对记录的赌率反馈进行增强学习的算法。与现有的针对监督学习的增强方法不同,我们的算法直接优化策略的期望奖励估计。我们对这个算法进行了分析,并证明了每轮增强时,超额经验风险都会减少(可能呈指数级下降),前提是基学习者满足“弱”学习条件。我们进一步展示如何将基学习者降低到监督学习,从而打开了一系列可用的基学习者,这些基学习者具有实际收益,例如决策树。实验表明,我们的算法继承了基于树的增强算法的许多理想性质(例如,对特征缩放和超参数调整的鲁棒性),并且可以胜过基于深度神经网络的离线学习以及仅回归观察到的奖励的方法。