Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
翻译:多武装土匪问题提供了一个框架,用以确定对一系列重复实验的最佳干预。 没有额外的假设, 小型最佳性能( 以累积的遗憾衡量) 是完全理解的。 通过访问额外的观测变量, d 将干预与结果分离( 即它们是一个分离器), 最近的“ causal 土匪” 算法可能会产生较少的遗憾。 但是, 在实际中, 最好是对观测到的变量是否为 d- 分离器持谨慎态度。 理想的情况是, 一种算法应该是适应性的; 也就是说, 一种接近于对 d- 分离器的存在或缺失有某种了解的算法。 在这项工作中, 我们正式确定和研究这一适应性概念, 并提供一种新的算法, 既能(a) 观察到 d- 分离器, 也能够改善经典的迷你算法, 并且 (b) 当观察到的变量不是 分隔器时, 要比最近的因果算法要小得多。 关键是, 我们的算法并不要求任何对一般标准进行调整, 。