A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. Since the arm choice depends on the past context and reward pairs, the contexts of chosen arms suffer from correlation and render the analysis difficult. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling (TS) that applies the DR technique used in missing data literature to TS. The proposed algorithm improves the bound of TS by a factor of $\sqrt{d}$, where $d$ is the dimension of the context. A benefit of the proposed method is that it uses all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of TS. Empirical studies show the advantage of the proposed algorithm over TS.
翻译:盗匪问题的一个具有挑战性的方面是,只对所选的手臂进行抽查性奖励,而其他武器的奖励仍然缺失。由于手臂的选择取决于过去的背景和奖赏,所选的手臂的背景存在关联性,使分析难于进行。我们建议采用新的多武装背景土匪算法,称为Doubly Robust (DR) Thompson 抽样(TS),将缺失的数据文献中使用的DR技术应用于TS。提议的算法将TS的界限提高1美元,即$d$是背景的维度。拟议方法的一个好处是,它使用所有背景数据,无论是选择还是未选择,从而可以绕过TS理论分析中使用的不饱和武器的技术定义。 Empirical 研究显示,提议的算法比TS的优势在于$@sqrt{d}。