Healthcare datasets often contain groups of highly correlated features, such as features from the same biological system. When feature selection is applied to these datasets to identify the most important features, the biases inherent in some multivariate feature selectors due to correlated features make it difficult for these methods to distinguish between the important and irrelevant features and the results of the feature selection process can be unstable. Feature selection ensembles, which aggregate the results of multiple individual base feature selectors, have been investigated as a means of stabilising feature selection results, but do not address the problem of correlated features. We present a novel framework to create feature selection ensembles from multivariate feature selectors while taking into account the biases produced by groups of correlated features, using agglomerative hierarchical clustering in a pre-processing step. These methods were applied to two real-world datasets from studies of Alzheimer's disease (AD), a progressive neurodegenerative disease that has no cure and is not yet fully understood. Our results show a marked improvement in the stability of features selected over the models without clustering, and the features selected by these models are in keeping with the findings in the AD literature.
翻译:卫生保健数据集通常包含高度关联的特征群,例如同一生物系统的特征。当将特征选择应用于这些数据集以确定最重要的特征时,某些多变量特征选择器因相关特征而固有的偏差使得这些方法难以区分重要和不相关的特征和特征选择过程的结果,因此这些方法很难区分重要和不相关的特征和特征选择过程的结果。特征选择组群综合了多个个人基本特征选择器的结果,作为稳定特征选择结果的一种手段,已被调查为一种手段,但并未解决关联特征问题。我们提出了一个新框架,用于从多个变量选择器中创建特征聚合物,同时考虑到相关特征组群产生的偏差,同时在处理前的步骤中使用聚合性等级组合组合。这些方法适用于阿尔茨海默氏病研究中的两个真实世界数据集(AD),这是一种进步性神经降解性疾病,没有治愈,而且尚未完全理解。我们的结果显示,在不组合的情况下,对模型选择的特征的稳定性有了显著改善,这些模型所选定的特征与AD文献中的调查结果是一致的。