Standard approaches for variable selection in linear models are not tailored to deal properly with high-dimensional and incomplete data. Currently, methods dedicated to high-dimensional data handle missing values by ad-hoc strategies, like complete case analysis or single imputation, while methods dedicated to missing values, mainly based on multiple imputation, do not discuss the imputation method to use with high-dimensional data. Consequently, both approaches appear to be limited for many modern applications. With inspiration from ensemble methods, a new variable selection method is proposed. It extends classical variable selection methods in the case of high-dimensional data with or without missing data. Theoretical properties are studied and the practical interest is demonstrated through a simulation study, as well as through an application to models specification in sequential multiple imputation. In the low dimensional case, the procedure improves the control of the error risks, especially type I error, even without missing values for stepwise, lasso or knockoff methods. With missing values, the method performs better than reference selection methods based on multiple imputation. Similar performances are obtained in the high-dimensional case with or without missing values.
翻译:线性模型中变量选择的标准方法并不适合于适当处理高维和不完全的数据。目前,用于高维数据的方法不是专门用来处理缺失的数值的,而是专门用来处理缺少的数值的方法,例如完整的案例分析或单一估算,而专门处理缺失的数值的方法,主要是基于多重估算的方法,并不讨论与高维数据一起使用的估算方法。因此,这两种方法对于许多现代应用来说似乎都是有限的。根据混合方法的灵感,提出了新的变量选择方法。在具有或没有缺失数据的高维数据的情况下,它扩展了传统的变量选择方法。对理论属性进行了研究,并通过模拟研究以及连续多重估算模型规格的应用显示了实际兴趣。在低维情况下,程序改进了错误风险的控制,特别是I型错误,即使没有缺失了分级值、拉索或开关方法。在缺少数值的情况下,该方法比基于多重估算的参考选择方法要好。在高维性案例中取得类似的性能,有或没有缺失值。