In multi-armed bandit (MAB) experiments, it is often advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. We develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit experiments that produces powerful and anytime-valid inference on the ATE for \emph{any} bandit algorithm of the experimenter's choice, even those without probabilistic treatment assignment. Intuitively, the MAD "mixes" any bandit algorithm of the experimenter's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that decreases the priority placed on the Bernoulli design as the sample size grows. We prove that for $\delta_t = \omega\left(t^{-1/4}\right)$, the MAD generates anytime-valid asymptotic confidence sequences that are guaranteed to shrink around the true ATE. Hence, the experimenter is guaranteed to detect a true non-zero treatment effect in finite time. Additionally, we prove that the regret of the MAD approaches that of its underlying bandit algorithm over time, and hence, incurs a relatively small loss in regret in return for powerful inferential guarantees. Finally, we conduct an extensive simulation study exhibiting that the MAD achieves finite-sample anytime validity and high power without significant losses in finite-sample reward.
翻译:暂无翻译