Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests are the bread and butter of modern platforms on the web. They are conducted continuously to allow platforms to estimate the causal effect of replacing system variant "A" with variant "B", on some metric of interest. These variants can differ in many aspects. In this paper, we focus on the common use-case where they correspond to machine learning models. The online experiment then serves as the final arbiter to decide which model is superior, and should thus be shipped. The statistical literature on causal effect estimation from RCTs has a substantial history, which contributes deservedly to the level of trust researchers and practitioners have in this "gold standard" of evaluation practices. Nevertheless, in the particular case of machine learning experiments, we remark that certain critical issues remain. Specifically, the assumptions that are required to ascertain that A/B-tests yield unbiased estimates of the causal effect, are seldom met in practical applications. We argue that, because variants typically learn using pooled data, a lack of model interference cannot be guaranteed. This undermines the conclusions we can draw from online experiments with machine learning models. We discuss the implications this has for practitioners, and for the research literature.
翻译:在线实验(如随机对照试验或A/B测试)是现代网络平台的重要组成部分。它们不断进行以允许平台估计以替换系统变量“A”为变量“B”对一些感兴趣的指标的因果效应。这些变量可以在许多方面不同。在本文中,我们重点讨论了它们通常对应于机器学习模型的常见用例。在线实验然后作为最终仲裁者,以决定哪个模型更优,应该推出。 关于从 RCTs 估计因果效应的统计文献有着应有的历史, 这增加了研究人员和实践者对这种评估实践的信任水平,但是在机器学习实验的特定情况下,我们指出某些关键问题仍然存在。具体而言,为确定A/B测试产生的无偏因果效应估计所需的假设很少在实际应用中得到满足。 我们认为,由于变体通常使用汇集的数据进行学习,因此无法保证模型之间没有干扰。这会削弱我们从任用机器学习模型的在线实验中可以得出的结论。我们还讨论了这对从业者和研究文献的影响。