Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can recover the good performance achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method RR-BART, that leverages the likelihood-based Bayesian machine learning technique, Bayesian Additive Regression Trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of missing data. A representative simulation study suggests that RR-BART performs at least as well as combining bootstrap with BART, BI-BART, but offers substantial computational savings, even in complex conditions of nonlinearity and nonadditivity with a large percentage of overall missingness under the MAR mechanism. RR-BART is also less sensitive to the end note prior via the hyperparameter $k$ than BI-BART, and does not depend on the selection threshold value $\pi$ as required by BI-BART. Our simulation studies also suggest that encoding the missing values of a binary predictor as a separate category significantly improves the power of selecting the binary predictor for BI-BART. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
翻译:先前的工作表明,将陷阱估算法与基于树的机器学习变量选择方法相结合,可以恢复在随机(MAR)中缺少共变和结果数据时完全观察到的数据所实现的良好业绩。然而,这一方法在计算上成本很高,特别是在大型数据集中。我们建议采用基于推论的方法RR-BART, 利用基于可能性的巴耶斯机器学习技术,巴耶西亚Additive Refrest 树,并使用Rubin的规则,结合在缺少数据的情况下,为选择变量而采用多计算数据集的可变重要措施的估计数和差异。 代表性模拟研究表明,RR-BARRT至少要与BART(BI-BART)一起运行并结合,同时提供大量计算节余,即使在不直线性和不相加的复杂条件下,而且在MAR机制下总体缺失的比例很大。RRR-BART规则也不太敏感到我们之前通过超比基美元来进行最后注释,并且并不取决于选择AR-B(B)风险评估的临界值值值值值值值值值,作为BI(BI)的单独案例的预测,也表明,通过BI(BI)的巴西)的预算的预算的预测。