Multi-armed bandit algorithms minimize experimentation costs required to converge on optimal behavior. They do so by rapidly adapting experimentation effort away from poorly performing actions as feedback is observed. But this desirable feature makes them sensitive to confounding, which is the primary concern underlying classical randomized controlled trials. We highlight, for instance, that popular bandit algorithms cannot address the problem of identifying the best action when day-of-week effects may confound inferences. In response, this paper proposes deconfounded Thompson sampling, which makes simple, but critical, modifications to the way Thompson sampling is usually applied. Theoretical guarantees suggest the algorithm strikes a delicate balance between adaptivity and robustness to confounding. It attains asymptotic lower bounds on the number of samples required to confidently identify the best action -- suggesting optimal adaptivity -- but also satisfies strong performance guarantees in the presence of day-of-week effects and delayed observations -- suggesting unusual robustness. At the core of the paper is a new model of contextual bandit experiments in which issues of delayed learning and distribution shift arise organically.
翻译:多武装土匪算法最大限度地减少最佳行为所需的实验成本。 它们通过快速调整实验努力, 避免执行不力的行动, 并观察到反馈。 但这一可取的特征使得它们敏感于混淆, 这是古典随机控制试验背后的主要关切。 例如,我们强调,在周日影响可能混淆推断时,大众土匪算法无法解决确定最佳行动的问题。 作为回应,本文件建议取消Thompson抽样,这简单而关键地改变了Thompson采样的通常应用方式。 理论保障表明,算法在适应性和稳健性之间取得了微妙的平衡。 它在自信地确定最佳行动所需的样本数量上达到了无药可耐的下限范围 -- -- 暗示最佳适应性,但是在存在周日影响和延迟观察时,也得到了强有力的性保证。 本文的核心是一个新的背景土匪实验模式, 由此产生了有机的延迟学习和分配转移问题。