As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.
翻译:由于在科学研究和工业应用中越来越多地使用土匪算法,因此越来越需要根据由此得出的适应性收集的数据,制定可靠的推论方法。在这项工作中,我们开发了使用土匪算法对分批收集的数据进行推论的方法。我们首先证明,在独立抽样数据中,普通的最小正方形估测器(OLS)在随机正常,在没有独特最佳臂的情况下,使用标准土匪算法收集的数据并不具有同样正常的状态。这种无药可依的非正常性结果意味着,天真的假设,即OSS估测仪大约是正常的,可能会导致类型-1误差通货膨胀和信任度间隔,其次,我们引入了Batched OLS估测器(BOLS),我们证明:(1) 从多臂和背景土匪收集的数据是无常态的,(2) 在基线奖励中坚固到非常态。