在缺少数据的情况下灵活选择变量 (Flexible variable selection in the presence of missing data)

In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose several nonparametric variable selection algorithms combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithms that achieve control of commonly used error rates. Through simulations, we show that our proposals have good operating characteristics and result in panels with higher classification performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed methods to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.

翻译：在许多应用中,需要从多个候选人中确定一组在预测答复时达到预期响应达到预期工作业绩理想水平的简单特征或专门小组。这项任务在实践中往往因抽样设计或其他随机机制产生的缺失数据而变得复杂。在缺失的数据背景中,大多数最近关于变量选择的近期工作在缺失数据背景中大多依赖某些程度的有限统计模型,例如普遍或惩罚的线性模型。在这种模型定义错误的情况下,所选择的变量可能并非全部都具有真正科学相关性,并可能导致分类性能低于最优化的小组。为了应对这一限制,我们建议采用若干非参数变量选择算法,加上多种估算法,在缺少随机数据或其他随机数据的情况下,在实际中往往因缺少数据或其他随机数据而缺少数据而出现数据,从而在实际设计灵活小组,我们根据对常用错误率进行控制的拟议算法概述战略。我们通过模拟表明,我们的建议具有良好的操作特点,并导致在普遍线性模型错误描述不正确的情况下,与若干现行惩罚性回归方法相比,我们的小组的分类性表现可能更高。最后,我们采用拟议的方法,以开发生物标记小组,以在这种方法,在出现这一方法,在出现这种限制的情况下,在出现各种恶线模式的情况下,我们使用若干生物标记丢失的、但又缺核质、重、但又出现复杂、重质质质质、有复杂、重、、在生物机号号在生物基质质质质质质、要在生物基质质质的里,在生物研究中,在生物研究中,在生物基质和生物研究后,在生物研究中,在生物基质在生物基质的、有不同的、有不同性在生物基质的、有不同性、制的、有不同性、有不同的生物基质的、产生复杂、有不同性、制、制、有不同性、有不同性、制、有不同性在生物研究、制、制、制、有不同性、制、制、制、制、制、制、制、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同性、有不同的、有不同性、有不同性、有不同性、有不同性、有不同性、有不同的生物