在缺少数据的情况下灵活选择变量 (Flexible variable selection in the presence of missing data)

In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.

翻译：在许多应用中,需要从多个候选人中确定一组在预测答复时达到预期反应预期性工作预期业绩水平的简单特征或专门小组。这一任务在实践中往往因抽样设计或其他随机机制产生的缺失数据而变得复杂。在缺失的数据背景中,关于变量选择的近期工作大多主要基于缺失数据背景中的大多数近期工作,在某种程度上依赖于一个有限维统计模型,例如普遍或惩罚的线性模型。在这种模型定义错误的情况下,所选择的变量可能并不都具有真正科学相关性,并可能导致分类性能低于最优化的小组。为了应对这一限制,我们建议采用一个非参数性变量选择算法,加上多种估算法,以便在出现意外失踪的数据或其他随机数据时,开发灵活的小组。我们根据拟议的算法概述战略,对常用错误率进行控制。我们通过模拟,表明我们的建议具有良好的操作特点,其结果优于分类和选择业绩的小组,在一般线性模型错误描述错误的情况下,与若干现行惩罚性回归法方法相比,我们可能不完全相关。最后,我们建议采用拟议的方法开发生物标记小组,以便在出现普遍线性模型的情况下,用生物标记小组来分断断断断断断断断断断断稳性骨质、断断断断骨质、断骨质、骨质、骨质、骨质、骨质、骨质、骨质、骨质、骨质、骨质再研重、制、在生物基质失序、制、制、在生物研研研研重、制、制、制、有不同研究、在生物失序、制、有不同的研研研重、制、制、在生物研研研研制、制、制、制、制、制、在生物研制、制、在生物研研研制、制、在生物研研研制、制、制、制、制、制、制、制、制、制、制、制、制、制、制、制、研制、制、制、制、制、制、制、制、制、在生物研制、在生物研制、研制、研制、研制、研制、研制、研制、研制、研制、研制、研制、在生物研制、研制、研制、在生物研制、研制、研制、研制、研制、

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日