Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a "Catch-22" that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.
翻译:在线平台定期进行随机实验,以了解平台的变化如何因果影响各种感兴趣的结果。然而,在线平台实验因缺乏有意义的监督和用户同意而遭到批评,因为缺乏有意义的监督和用户同意等问题。随着平台为用户提供更大的机构,因此可以进行观察研究,让用户自行选择将利益作为实验的替代方法,以替代平台控制用户是否接受治疗的实验。在本文中,我们在Twitter上进行四次大规模内部研究比较,目的是评估在线平台用户自我选择产生的观察研究的实效。在一项内部比较中,观察研究研究的治疗效果被批评为基于他们如何有效地复制与同一目标人群随机实验的结果。我们测试组群中的天性差异意味着估算、精确匹配、回归调整和治疗偏差的权重,同时控制可信的分辨变量。在所有这些情况下,所有观测估计都无法很好地从类似随机实验中恢复地面评估的结果。除了一个案例外,观察估计的观察估计结果是,观测结果的相反信号是,它们复制了与同一目标人群随机实验的结果。 我们的观察显示,这些随机选择的自我选择结果是“在选择的逻辑平台上得出了我们错误的自我结果。 ” 。我们观察结果显示,我们通过随机的实验结果在随机选择的实验结果中,在选择的理论上的自我分析结果显示,在选择结果中提供了一种结果是“在选择的自我结果,在选择后,在选择的实验结果中,我们对结果。