Contextual bandit algorithms often estimate reward models to inform decision-making. However, true rewards can contain action-independent redundancies that are not relevant for decision-making. We show it is more data-efficient to estimate any function that explains the reward differences between actions, that is, the treatment effects. Motivated by this observation, building on recent work on oracle-based bandit algorithms, we provide the first reduction of contextual bandits to general-purpose heterogeneous treatment effect estimation, and we design a simple and computationally efficient algorithm based on this reduction. Our theoretical and experimental results demonstrate that heterogeneous treatment effect estimation in contextual bandits offers practical advantages over reward estimation, including more efficient model estimation and greater flexibility to model misspecification.
翻译:环境土匪算法常常估计奖励模式,以指导决策。然而,真正的奖励可能包含与决策无关的、与行动无关的冗余。我们显示,估算解释行动之间奖励差异的任何功能,即治疗效果,更具有数据效率。 受这一观察的驱动,我们根据最近关于基于神器的土匪算法的工作结果,首次将背景土匪减为一般用途的多种治疗效果估计,我们根据这一缩减设计了一个简单、计算高效的算法。 我们的理论和实验结果显示,背景土匪的不同治疗效果估计比奖励估计具有实际优势,包括更高效的模型估计,以及更灵活地模拟错误的分类。</s>