The recent proliferation of medical data, such as genetics and electronic health records (EHR), offers new opportunities to find novel predictors of health outcomes. Presented with a large set of candidate features, interest often lies in selecting the ones most likely to be predictive of an outcome for further study such that the goal is to control the false discovery rate (FDR) at a specified level. Knockoff filtering is an innovative strategy for FDR-controlled feature selection. But, existing knockoff methods make strong distributional assumptions that hinder their applicability to real world data. We propose Bayesian models for generating high quality knockoff copies that utilize available knowledge about the data structure, thus improving the resolution of prognostic features. Applications to two feature sets are considered: those with categorical and/or continuous variables possibly having a population substructure, such as in EHR; and those with microbiome features having a compositional constraint and phylogenetic relatedness. Through simulations and real data applications, these methods are shown to identify important features with good FDR control and power.
翻译:最近医疗数据的扩散,如遗传学和电子健康记录(EHR),为寻找新的健康结果预测器提供了新的机会。提出了一系列可供选择的特征,人们的兴趣往往在于选择最有可能预知一项成果,以便进一步研究的结果,目的是控制特定水平的虚假发现率(FDR),而决裂过滤是FDR控制的特征选择的创新战略。但是,现有的淘汰方法提供了强大的分布假设,妨碍了它们适用于真实世界数据。我们提出了利用数据结构现有知识生成高质量入门副本的贝耶斯模型,从而改进预测特征的分辨率。考虑对两种特征的应用:具有绝对和/或连续变量的,可能具有人口亚结构的,如EHR;以及具有微生物特征的,具有构成制约和植物遗传特性的。通过模拟和真实数据应用,这些方法可以发现具有良好FDR控制和功能的重要特征。