项目名称: 高维生物数据的PLS特征选择方法研究
项目编号: No.61473329
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 其他
项目作者: 游文杰
作者单位: 福建师范大学
项目金额: 57万元
中文摘要: 基于统计计算与机器学习理论方法,针对高维数、小样本、高噪声、强相关和多类别的生物数据,研究偏最小二乘特征选择模型算法。给出考虑交互效应的多特征选择算法,筛选较小主效应且有较强交互效应的信息特征;引入递归特征消除策略的多特征选择算法,提升所选子集的一致性和紧致性;给出多扰动的集成特征选择方法,增强所选特征子集的稳健性;提出选维与降维的特征级信息融合框架,挖掘高维数据的潜结构信息;开发实现计算分析工具。将研究算法应用到全基因组水平的肿瘤基因表达分析中,筛选出肿瘤特异表达基因,提取表达模式和共调节基因,辅助生物学家理解和解释肿瘤基因的特异表达机制,达到有效辅助生物实验的水平。本研究计划,有助于加强高维小样本多类别生物数据的处理方法研究,促进生物信息处理和前沿问题的理解,对数据挖掘方法与生物学科的结合研究有着信息学与生物学意义。
中文关键词: 特征提取;特征选择;鉴别分析;数据挖掘
英文摘要: In view of the data with high-dimensional small sample (HDSS), high noise, strong relevance and multi-class, our project focuses on the models and algorithms of feature reduction based on the theory and methods of statistical computing and machine learning. We present multi-feature selection, which takes into accounts the combined effects of all the features and the correlation among the features, indirectly consider the joint distribution of features, and effective detect the features with a relatively small main effect, but with a strong interaction effect; We present a novel multi-feature selection based on recursive feature elimination strategy, which can improve the consistency of the selected feature subset, and makes the selected feature subset more compact; We present multipertubation ensemble feature selection, which improve the affectiveness of the selected feature subset on the small sample data; We propose the novel method which implements information fusion of feature selection and feature extraction in a unified framework. It can effectively improve the generalization ability of the learner, and enhance the interpretability and understandability of recognition results. Moreover, our algorithm is computationally efficient especially for high-dimensional dataset, and it can be applied to both two-category classification and multi-category classification problems without limitation. Further, our methodology is applied to the study of tumor gene expression analysis on genome-wide level, and focusing on identify tumor-specific expressed genes and extract co-regulate genes. The works will assist biologists to understand and explain the mechanism of tumor-specific gene expression, and effectively assist the biological experiments level. The projected impact of our results will be of interest to cancer biologists, it will provide a new research paradigm in studies of other complex traits or diseases under multi-conditions. Our model and algorithms are also applied to the other study of biological information processing, to achieve the efficient feature selection, and to assistant biological experiments. Our research will help promote biological information processing and accelerate the understanding of its frontier issues. It can provide a theoretical basis and practical calculation methods to solve complex calculation of HDSS.
英文关键词: feature exaction;feature selection;Discriminant analysis;data mining