With the modern software and online platforms to collect massive amount of data, there is an increasing demand of applying causal inference methods at large scale when randomized experimentation is not viable. Weighting methods that directly incorporate covariate balancing have recently gained popularity for estimating causal effects in observational studies. These methods reduce the manual efforts required by researchers to iterate between propensity score modeling and balance checking until a satisfied covariate balance result. However, conventional solvers for determining weights lack the scalability to apply such methods on large scale datasets in companies like Snap Inc. To address the limitations and improve computational efficiency, in this paper we present scalable algorithms, DistEB and DistMS, for two balancing approaches: entropy balancing and MicroSynth. The solvers have linear time complexity and can be conveniently implemented in distributed computing frameworks such as Spark, Hive, etc. We study the properties of balancing approaches at different scales up to 1 million treated units and 487 covariates. We find that with larger sample size, both bias and variance in the causal effect estimation are significantly reduced. The results emphasize the importance of applying balancing approaches on large scale datasets. We combine the balancing approach with a synthetic control framework and deploy an end-to-end system for causal impact estimation at Snap Inc.
翻译:随着现代软件和在线平台收集大量数据,在随机实验不可行的情况下,越来越需要大规模应用因果推断方法。在观测研究中,直接纳入共差平衡的加权方法最近在估计因果关系方面越来越受欢迎。这些方法减少了研究人员在模拟和平衡核对之间迭代惯性分数模型和平衡检查直至令人满意的共差平衡结果所需的人工努力。然而,用于确定重量的常规解决方案缺乏在Snap Inc等公司大型数据集中应用这种方法的可缩放性。为了解决局限性并改进计算效率,我们在本文件中为两种平衡方法提出了可缩放的算法、DistEB和DistMS:变相平衡和MicroSynth。这些解算器具有线性时间复杂性,可以在分布式计算框架(如Spark、Hive等)中方便地执行。我们研究在100万处理单位和487个共差的大小公司中平衡方法的特性。我们发现,在更大样本规模、因果关系估计方面,对因果关系的偏差和偏差性估算都大大缩小了。我们强调对大规模部署影响进行比例平衡的方法的重要性。我们从头将采用一个系统对冲的方法。