Variable selection is crucial for sparse modeling in this age of big data. Missing values are common in data, and make variable selection more complicated. The approach of multiple imputation (MI) results in multiply imputed datasets for missing values, and has been widely applied in various variable selection procedures. However, directly performing variable selection on the whole MI data or bootstrapped MI data may not be worthy in terms of computation cost. To fast identify the active variables in the linear regression model, we propose the adaptive grafting procedure with three pooling rules on MI data. The proposed methods proceed iteratively, which starts from finding the active variables based on the complete case subset and then expand the working data matrix with both the number of active variables and available observations. A comprehensive simulation study shows the selection accuracy in different aspects and computational efficiency of the proposed methods. Two real-life examples illustrate the strength of the proposed methods.
翻译:变量选择对于在海量数据这个时代进行稀疏的建模至关重要。 缺失值在数据中很常见, 使变量选择更加复杂。 多算法( MI) 方法导致缺失值的计算数据集的倍增, 并在各种变量选择程序中广泛应用。 但是, 直接执行整个MI数据或环绕的 MI数据中的变量选择, 在计算成本方面可能不值得。 为了快速识别线性回归模型中的活动变量, 我们提议采用适应性组合程序, 并有三项关于MI数据的集合规则。 提议的方法是迭接式的, 首先是根据完整案例子集查找活动变量, 然后以现有变量的数量和现有观测结果来扩展工作数据矩阵。 全面模拟研究显示了拟议方法的不同方面的选择准确性和计算效率。 两个真实的示例说明了拟议方法的强度。