While many areas of machine learning have benefited from the increasing availability of large and varied datasets, the benefit to causal inference has been limited given the strong assumptions needed to ensure identifiability of causal effects; these are often not satisfied in real-world datasets. For example, many large observational datasets (e.g., case-control studies in epidemiology, click-through data in recommender systems) suffer from selection bias on the outcome, which makes the average treatment effect (ATE) unidentifiable. We propose a general algorithm to estimate causal effects from \emph{multiple} data sources, where the ATE may be identifiable only in some datasets but not others. The key idea is to construct control variates using the datasets in which the ATE is not identifiable. We show theoretically that this reduces the variance of the ATE estimate. We apply this framework to inference from observational data under outcome selection bias, assuming access to an auxiliary small dataset from which we can obtain a consistent estimate of the ATE. We construct a control variate by taking the difference of the odds ratio estimates from the two datasets. Across simulations and two case studies with real data, we show that this control variate can significantly reduce the variance of the ATE estimate.
翻译:虽然机器学习的许多领域都得益于大量不同数据集的日益普及,但由于为确保因果关系的可核实性而需要进行的强有力的假设,因此因果推断的惠益有限;在现实世界的数据集中,这些假设往往不能令人满意。例如,许多大型观测数据集(例如流行病学的病例控制研究、建议系统中的点击-通过数据)在结果上存在选择偏差,使得平均处理效果(ATE)无法被识别。我们提议了一种通用算法,以估计来自\emph{multiple}数据源的因果效应,其中ATE只能在某些数据集中识别,而其他数据源则无法识别。关键的想法是利用无法识别ATE的数据集构建控制变量。我们从理论上表明,这缩小了ATE估计数的差异。我们用这个框架来推断结果选择偏差下的观测数据,假设可以使用一个辅助的小型数据集,从而获得对ATE的一致估计。我们用控制变量构建了控制变量变量,通过使用两种数据模型进行差异性估算,我们用两个模型来大大缩小了对数值的对比。