We explore a new model of bandit experiments where a potentially nonstationary sequence of contexts influences arms' performance. Context-unaware algorithms risk confounding while those that perform correct inference face information delays. Our main insight is that an algorithm we call deconfounted Thompson sampling strikes a delicate balance between adaptivity and robustness. Its adaptivity leads to optimal efficiency properties in easy stationary instances, but it displays surprising resilience in hard nonstationary ones which cause other adaptive algorithms to fail.
翻译:我们探索了一种新的土匪实验模式,在这个模式中,潜在的非静止环境序列会影响武器性能。 上下文软件算法可能会混淆,而那些进行正确推论的人则会面临信息延误。 我们的主要见解是,我们所称的松散的汤普森抽样算法在适应性和稳健性之间取得了微妙的平衡。 它的适应性在简单固定的情况下可以带来最佳的效率特性,但在硬性非静止的算法中却表现出惊人的韧性,导致其他适应性算法失败。