项目名称: 对具有非平衡多标签特性的蛋白质功能类型分类预测研究
项目编号: No.61462047
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 计算机科学学科
项目作者: 林卫中
作者单位: 景德镇陶瓷大学
项目金额: 45万元
中文摘要: 在生命科学中,那些同时具有多种功能属性的蛋白质往往在生命过程中起着更重要的作用,而在生物数据库中具有这样特性的蛋白质样本的数量分布是极不均衡的,这种情况急需我们开发新的预测模型来处理非平衡多标签数据集下的蛋白质功能预测。本项目将研究蛋白质特征表达模型,通过提取蛋白质基因本体的网络和层次特征,以及基于特异位置打分矩阵的蛋白质演化特征构建两种新型的蛋白质特征表示模型;构建亚细胞定位和抗菌肽功能类型信息数据库,并提供相应的数据检索和分析工具;研究对非平衡多标签数据的分类算法,应用灰色关联空间接近度分析方法构建非平衡多标签数据集的样本重采样技术,利用类别属性加权集成多分类器输出;最终开发亚细胞多标签定位、抗菌肽功能预测等蛋白质功能多标签预测工具。项目的开展有助于推进生物信息挖掘技术的发展,推进对非平衡多标签数据集的模式识别研究工作。
中文关键词: 蛋白质功能预测;非平衡多标签分类;多标签数据灰色重采样;类别属性加权多分类器集成;序列特征表示
英文摘要: In the life sciences, proteins which have various functions properties often play key roles in the biology process. Furthermore, those proteins have unbalanced quantity distribution in biological databases. It is urgent for us to develop novel models classifying multi-labeled proteins on imbalanced datasets. Firstly, we will study how to represent proteins. Two models will be constructed. One extracts the network and hierarchical structure features of proteins GO (Gene Ontology). And the other draws out the evolutional information from the PSSM (Position Specific Scoring Matrix). Secondly, we will construct the database of subcellular localization and antimicrobial peptides, and provide searching and anlysing tools, respectively. Thirdly, we will research classification algorithm on imbalanced and multi-labeled datasets. we resampple on imbalanced and multi-labeled datasets by using grey incidence analysis, and ensemble various classifiers by weigthing classes' labels. At last, tools to predict subcellular and antimicrobial peptides are developed. These researches will promote the development of biological information mining technology and pattern recognition techniques on imbalanced multi-label datasets.
英文关键词: Predicting proteins function classification;Imbalanced multi-label classifier;Grey resmalping in multi-labeled dataset;Ensemble classifiers by weighting class labels;Expressing proteins features