The missing data issue is ubiquitous in health studies. Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic but has been less studied. Existing literature focuses on parametric regression techniques that provide direct parameter estimates of the regression model. Flexible nonparametric machine learning methods considerably mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning modeling techniques and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, XGBoost, Random Forests, Bayesian Additive Regression Trees (BART) and Conditional Random Forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that when combined with bootstrap imputation, XGBoost and BART have the overall best variable selection performance with respect to the $F_1$ score and Type I error across various settings. In general, there is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
翻译:缺少的数据问题在健康研究中普遍存在。在缺少的共变和结果同时进行变量选择是一个重要的统计研究主题,但研究较少。现有文献侧重于提供回归模型直接参数估计的参数回归技术。灵活的非对称机器学习方法大大减轻了对参数假设的依赖,但没有像参数模型本身的共变效果那样提供自然定义的变量重要性计量。当共变和结果可能随机丢失并具有一般缺失的数据模式时,我们调查一种一般变量选择方法。这种方法利用机器学习建模技术和靴套式估算法的灵活性,这种方法容易采用非对称方法,提供回归模型模型模型直接的参数估计。我们进行扩张性模拟,调查拟议变量选择方法的实际操作特点,同时结合四个基于树的机器学习方法、XGBoest、随机森林、Bayesian Additivitive Regress (BART) 和 Conditional Armainal Formation Froundation 方法,我们利用了两种常用的参数方法, lasso-ax-Annex-Fration ration ral respeal restial respeal resmation reseration respeal resmess resmess 和Bal respeal resmal resual resmal resmal resmal resmal resmal resmal resmal resmal resmal resmal resmal 。我们,我们算算算算算算算算算算算算算算算算出了比数, 和B tral 和B 和B trital 和B trevental 和B trevental 和B trevental 和B 和B 不同性成绩总B 和不同性方法,我们B 和不同性方法,我们B del 和不同性 delal 和不同的可变式缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩取结果。我们