项目名称: 高维数据特征选择的稳定性研究
项目编号: No.61202144
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 杨帆
作者单位: 厦门大学
项目金额: 23万元
中文摘要: 高维特征选择的稳定性是一个重要而又尚未解决的难题。已有的特征选择研究主要关注学习机器的预测准确率和计算效率,以准确率及其相关指标作为评价特征选择结果的依据。然而在高维数据空间中,训练数据集上的微小变化会造成特征选择结果的不稳定和不可靠。本项目以基因表达数据为研究对象,从高维数据空间和基因表达数据的分布特点出发,分析高维数据特征选择不稳定性的可能来源,以改善其稳定性和可靠性。 研究内容包括:通过分析高维数据分布的特点,建立特征选择的稳定性指标;通过对经典特征选择算法的分析,研究基于目标函数的特征评价准则;考虑到特征之间的关联性,提出基于隐变量模型的特征选择策略;针对数据分布的局部性,设计基于分解的多分类特征选择方法;进一步考虑到类内分布的多样性,提出"聚类-特征选择"的递归式局部特征选择策略。本项目的研究成果将提升高维特征选择的稳定性,并将应用到基因选择、基因调控网络和癌症亚型的发现中。
中文关键词: 高维数据;特征选择;稳定性;基因表达数据;
英文摘要: Stability of feature selection from high dimensional data is an important yet under-addressed issue. Existing feature selection methods focus on improving the performance of classifiers, such as prediction accuracy, computational efficiency etc., and use these metrics to evaluate the quality of feature subsets produced by feature selection algorithms. Unfortunately, the results of feature selection algorithms might be unstable and unreliable in high-dimensional spaces because they are very sensitive to different variations in the data. In order to improve the stability and reliability of feature selection algorithms, this project analyzes the major causes of the instability by investigating the distribution of gene expression data in a high-dimensional space. The research merits of this project include: (1) A new stability measurement of feature selection is proposed based on the characteristics of high dimensional data distribution; (2) A feature evaluation criteria based on the classification objective function is presented through the analysis of classical feature selection algorithms; (3) A hidden variable model based feature selection algorithm is proposed by taking into account the correlation between the features; (4) A decomposition-based feature selection method for multiclass classification is desig
英文关键词: high dimensional data;feature selection;stability;gene expression data;