Gene expression datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods tend to be not applicable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, so they do not detect interactions between factors. In this paper we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which we have applied in three gene expression classification problems. We statistically prove that the proposed methods outperform the state of the art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as feature selection methods of attribute subset evaluation based on correlation and consistency with multi-objective evolutionary search strategy.
翻译:基因表达数据集通常具有高度的维度,因此,需要高效、有效的方法来确定其属性的相对重要性。由于可能的解决方案搜索空间巨大,属性子集选择方法往往不适用,因此在这些假设情景中,采用分级方法。文献中描述的大多数特征排序方法都是单向方法,因此它们无法检测各种因素之间的相互作用。在本文件中,我们提议了两种基于双向相关性和对对称一致性的新的多变量排序方法,我们已在三种基因表达式分类问题中应用了这两种方法。我们从统计上证明,拟议方法优于艺术特征排序方法的状态,即分类、变异、奇方、关联、信息增益、救济F和标志,以及基于与多目标进化搜索战略的关联和一致性的属性组合评估特征选择方法。