Selection of covariates is crucial in the estimation of average treatment effects given observational data with high or even ultra-high dimensional pretreatment variables. Existing methods for this problem typically assume sparse linear models for both outcome and univariate treatment, and cannot handle situations with ultra-high dimensional covariates. In this paper, we propose a new covariate selection strategy called double screening prior adaptive lasso (DSPAL) to select confounders and predictors of the outcome for multivariate treatments, which combines the adaptive lasso method with the marginal conditional (in)dependence prior information to select target covariates, in order to eliminate confounding bias and improve statistical efficiency. The distinctive features of our proposal are that it can be applied to high-dimensional or even ultra-high dimensional covariates for multivariate treatments, and can deal with the cases of both parametric and nonparametric outcome models, which makes it more robust compared to other methods. Our theoretical analyses show that the proposed procedure enjoys the sure screening property, the ranking consistency property and the variable selection consistency. Through a simulation study, we demonstrate that the proposed approach selects all confounders and predictors consistently and estimates the multivariate treatment effects with smaller bias and mean squared error compared to several alternatives under various scenarios. In real data analysis, the method is applied to estimate the causal effect of a three-dimensional continuous environmental treatment on cholesterol level and enlightening results are obtained.
翻译:协变量的选择对于给定高维或超高维前处理变量的观察数据中的平均治疗效果估计至关重要。现有的方法通常假设成果和单变量治疗的稀疏线性模型,并且无法处理超高维协变量的情况。本文提出了一种新的协变量选择策略,名为双筛选先验自适应套索(DSPAL),用于选择多元治疗中的混杂因素和成果的预测因子。该策略结合边缘条件(非)独立先验信息和自适应套索方法来选择目标协变量,以消除混淆偏差并提高统计效率。我们提出的方法具有以下独特特征:它可以应用于多元治疗的高维或超高维协变量,并且可以处理参数和非参数成果模型的情况,使其相对于其他方法更加稳健。我们的理论分析表明,所提出的过程具有确定筛选性质、排序一致性性质和变量选择一致性性质。通过模拟研究,我们证明了在各种情况下,相对于其他替代方案,所提出的方法选择了所有混淆因素和预测因子,并估计了多元治疗效应,使偏差和均方误差更小。在实际数据分析中,该方法被应用于在胆固醇水平上估计三维连续环境治疗的因果效应,并获得了有益的结果。