When developing a new networking algorithm, it is established practice to run a randomized experiment, or A/B test, to evaluate its performance. In an A/B test, traffic is randomly allocated between a treatment group, which uses the new algorithm, and a control group, which uses the existing algorithm. However, because networks are congested, both treatment and control traffic compete against each other for resources in a way that biases the outcome of these tests. This bias can have a surprisingly large effect; for example, in lab A/B tests with two widely used congestion control algorithms, the treatment appeared to deliver 150% higher throughput when used by a few flows, and 75% lower throughput when used by most flows-despite the fact that the two algorithms have identical throughput when used by all traffic. Beyond the lab, we show that A/B tests can also be biased at scale. In an experiment run in cooperation with Netflix, estimates from A/B tests mistake the direction of change of some metrics, miss changes in other metrics, and overestimate the size of effects. We propose alternative experiment designs, previously used in online platforms, to more accurately evaluate new algorithms and allow experimenters to better understand the impact of congestion on their tests.
翻译:开发新的网络算法时, 运行随机实验或 A/ B 测试来评估其性能是既定的做法。 在 A/ B 测试中, 交通量是随机分配的, 在一个使用新算法的治疗组与使用现有算法的对照组之间, 但是, 由于网络拥挤, 处理和控制交通量相互竞争, 从而影响这些测试的结果。 这种偏差可能会产生令人惊讶的大效果; 例如, 在实验室A/ B 测试中, 有两种广泛使用的拥堵控制算法, 治疗似乎能提供更高150 % 的吞吐量, 在多数流使用时, 交通量减少75 % 的吞吐量, 而大多数流使用时, 使用这两种算法都具有相同的吞吐量。 但是, 在实验室之外, 我们显示 A/ B 测试也可能在规模上产生偏差。 在与 Netflix 合作进行的实验中, A/ B 测试估计错误了某些测量指标的改变方向, 忽略了其他测量标准的变化, 并且过高估计了效果的大小。 我们建议替代的实验设计,, 先前在网络平台上使用的, 能够更准确地评估它们的影响。