We focus on the setting of contextual batched bandit (CBB), where a batch of rewards is observed from the environment in each episode. But the rewards of the non-executed actions are unobserved (i.e., partial-information feedbacks). Existing approaches for CBB usually ignore the rewards of the non-executed actions, resulting in feedback information being underutilized. In this paper, we propose an efficient reward imputation approach using sketching for CBB, which completes the unobserved rewards with the imputed rewards approximating the full-information feedbacks. Specifically, we formulate the reward imputation as a problem of imputation regularized ridge regression, which captures the feedback mechanisms of both the non-executed and executed actions. To reduce the time complexity of reward imputation, we solve the regression problem using randomized sketching. We prove that our reward imputation approach obtains a relative-error bound for sketching approximation, achieves an instantaneous regret with a controllable bias and a smaller variance than that without reward imputation, and enjoys a sublinear regret bound against the optimal policy. Moreover, we present two extensions of our approach, including the rate-scheduled version and the version for nonlinear rewards, making our approach more feasible. Experimental results demonstrated that our approach can outperform the state-of-the-art baselines on synthetic and real-world datasets.
翻译:我们的重点是设置背景分批的土匪(CBB), 每集都会从环境中看到一系列奖赏,但非执行行动的奖赏是不可观察的(即部分信息反馈 ) 。 CBB的现有方法通常忽视未执行行动的奖赏,导致反馈信息利用不足。 在本文中,我们建议采用一种高效的奖励估算方法,即使用CBBB的素描来完成未见的奖赏,即根据估算的奖赏与完整信息反馈相匹配。具体地说,我们将奖赏估算作为估算固定的峰值回归问题,它捕捉到未执行和执行的行动的反馈机制。为了减少奖赏估算的时间复杂性,我们用随机的草图来解决回归问题。我们证明,我们的奖赏估算方法在素描缩方法中获得了相对的高度约束,在可控制性偏差和较小差异的情况下实现了可控制性的误差,在不奖赏性估算的精度方面,在目前两次估算的基底值回归后, 并且根据我们目前最优的排序的列表政策中,包括我们目前最优的排序的列表,在目前最优的列表中,我们最优的排序的列表中,在了。