Online controlled experiments, or A/B tests, are large-scale randomized trials in digital environments. This paper investigates the estimands of the difference-in-means estimator in these experiments, focusing on scenarios with repeated measurements on users. We compare cumulative metrics that use all post-exposure data for each user to windowed metrics that measure each user over a fixed time window. We analyze the estimands and highlight trade-offs between the two types of metrics. Our findings reveal that while cumulative metrics eliminate the need for pre-defined measurement windows, they result in estimands that are more intricately tied to the experiment intake and runtime. This complexity can lead to counter-intuitive practical consequences, such as decreased statistical power with more observations. However, cumulative metrics offer earlier results and can quickly detect strong initial signals. We conclude that neither metric type is universally superior. The optimal choice depends on the temporal profile of the treatment effect, the distribution of exposure, and the stopping time of the experiment. This research provides insights for experimenters to make informed decisions about how to define metrics based on their specific experimental contexts and objectives.
翻译:暂无翻译