Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate "Catch-22"s that suggest that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.
翻译:在线平台定期进行随机实验,以了解平台的变化如何因果影响各种感兴趣的结果。然而,在线平台的实验因缺乏有意义的监督和用户同意而遭到批评,因为缺乏有意义的监督和用户同意等问题。随着平台给用户提供更大的机构,因此可以进行观察研究,让用户自行选择处理利益,作为平台控制用户接受治疗与否的试验的替代方法。在本文中,我们在推特上进行四次大规模内部研究比较,目的是评估在线平台用户自我选择得出的观察研究研究的实效。在一项内部比较中,观察研究研究的治疗效果被批评为基于他们如何有效地复制与同一目标人群随机实验的结果。我们测试组群中的天性差异意味着估算、精确匹配、回归调整和治疗偏差的权重,同时控制可信的分辨变量。在本文中,所有观测估计都无法很好地从类似的随机实验中恢复地面评估。除了一个案例外,观察估计的观察估计结果是,观测结果的相反信号是,它们复制了与同一目标人群随机实验的结果。 我们的观察结果显示,我们从原始的随机选择结果是“在选择的逻辑平台上得出了更差的自我结果”。