We study a variant of the multi-armed bandit problem where side information in the form of bounds on the mean of each arm is provided. We develop the novel non-optimistic Global Under-Explore (GLUE) algorithm which uses the provided mean bounds (across all the arms) to infer pseudo-variances for each arm, which in turn decide the rate of exploration for the arms. We analyze the regret of GLUE and prove regret upper bounds that are never worse than that of the standard UCB algorithm. Furthermore, we show that GLUE improves upon regret guarantees that exists in literature for structured bandit problems (both theoretically and empirically). Finally, we study the practical setting of learning adaptive interventions using prior data that has been confounded by unrecorded variables that affect rewards. We show that mean bounds can be inferred naturally from such logs and can thus be used to improve the learning process. We validate our findings through semi-synthetic experiments on data derived from real data sets.
翻译:我们研究多臂土匪问题的变式,即以每个手臂的平均值的界限的形式提供侧面信息。我们开发了新型的非乐观性全球低爆(GLUE)算法,该算法使用提供的平均界限(横跨所有武器)来推断每个手臂的假变量,这反过来又决定了武器勘探的速度。我们分析了GLUE的遗憾,并证明对上边界限的遗憾并不比标准的UCB算法的糟糕。此外,我们表明GLUE改进了文献中存在的结构性土匪问题(理论上和经验上两方面)的遗憾保证。最后,我们利用以前的数据研究适应性干预的实际环境,这些数据由未记录的、影响奖励的变数所混合起来。我们表明,从这些日志中可以自然地推断出平均值,从而可以用来改进学习过程。我们通过对来自真实数据集的数据进行半合成试验来验证我们的调查结果。