In the era of big data, the explosive growth of multi-source heterogeneous data offers many exciting challenges and opportunities for improving the inference of conditional average treatment effects. In this paper, we investigate homogeneous and heterogeneous causal data fusion problems under a general setting that allows for the presence of source-specific covariates. We provide a direct learning framework for integrating multi-source data that separates the treatment effect from other nuisance functions, and achieves double robustness against certain misspecification. To improve estimation precision and stability, we propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory; it assigns larger weights to samples containing more causal information with high interpretability. We introduce a two-step algorithm, the weighted multi-source direct learner, based on constructing a pseudo-outcome and regressing it on covariates under a weighted least square criterion; it offers us a powerful tool for causal data fusion, enjoying the advantages of easy implementation, double robustness and model flexibility. In simulation studies, we demonstrate the effectiveness of our proposed methods in both homogeneous and heterogeneous causal data fusion scenarios.
翻译:在海量数据时代,多种来源数据爆炸性增长为改进有条件平均治疗效果的推断提供了许多令人兴奋的挑战和机遇。在本文中,我们在允许源特定共差存在的一般环境下,调查同质和因果数据融合问题。我们提供了一个直接学习框架,将多种源数据与其他扰动功能分开,并针对某些错误的特性实现双重强力。为了提高估算精确度和稳定性,我们提议根据半对称效率理论理论的理论见解,增加因果信息的加权;对含有更多可解释性高因果信息的样本给予更大的加权权重。我们引入了两步算法,即加权多源直接学习者,其基础是根据加权最小标准构建假正方位,在共变法上倒退;它为我们提供了因果数据融合的强大工具,享有易于执行、双重稳健性和模型灵活性的优势。在模拟研究中,我们展示了我们所提议的方法在同质和可变因果数据合并设想中的有效性。