The contextual bandit problem is a theoretically justified framework with wide applications in various fields. While the previous study on this problem usually requires independence between noise and contexts, our work considers a more sensible setting where the noise becomes a latent confounder that affects both contexts and rewards. Such a confounded setting is more realistic and could expand to a broader range of applications. However, the unresolved confounder will cause a bias in reward function estimation and thus lead to a large regret. To deal with the challenges brought by the confounder, we apply the dual instrumental variable regression, which can correctly identify the true reward function. We prove the convergence rate of this method is near-optimal in two types of widely used reproducing kernel Hilbert spaces. Therefore, we can design computationally efficient and regret-optimal algorithms based on the theoretical guarantees for confounded bandit problems. The numerical results illustrate the efficacy of our proposed algorithms in the confounded bandit setting.
翻译:相关土匪问题是一个理论上合理的框架,在各个领域应用范围很广。 虽然先前关于该问题的研究通常要求噪音和背景之间独立, 但我们的工作考虑的是一个更明智的环境, 噪音成为既影响环境又影响奖励的潜在混淆器。 这种混杂的环境比较现实,可以扩展到更广泛的应用范围。 但是, 尚未解决的混杂者会在奖励功能估计方面造成偏差, 从而导致很大的遗憾。 为了应对混杂者带来的挑战, 我们应用了双重工具变量回归, 它可以正确识别真正的奖赏功能。 我们证明这种方法在两种广泛使用的再生骨骼Hilbert空间中接近最佳的趋同率。 因此, 我们可以根据对混杂土匪问题的理论保障来设计高效和遗憾的算法。 数字结果显示了我们所拟议的算法在混杂的土匪设置中的有效性 。