项目名称: 基于串联质谱数据的多肽鉴定半监督学习并行算法研究
项目编号: No.61503412
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 其他
项目作者: 梁锡军
作者单位: 中国石油大学(华东)
项目金额: 18万元
中文摘要: 蛋白质样品和生物实验的复杂性使得质谱图富含噪声,导致质谱匹配引擎如SEQUEST产生的多肽谱匹配含有大量假阴性鉴定.基于机器学习方法进行后验数据处理是当前蛋白质鉴定的关键环节.现有方法计算复杂度较高,不能高效处理大规模生物实验数据集,也没有充分利用生物实验积累的丰富多源异质信息,制约了蛋白质组鉴定技术的发展.在恰当的数学模型下设计多肽鉴定的并行算法并将多源异质信息相融合是实现高通量高精度蛋白质鉴定的关键. 半监督学习模型推广能力强,是多肽鉴定的有力数学模型.本课题拟以多肽鉴定为应用背景,研究半监督学习模型的并行求解算法,包括:构建半监督学习模型的样本约简规则;采用交替凸搜索算法框架,建立半监督学习模型的高效并行算法;建立多源异质信息融合与多肽鉴定的统一学习框架;通过实验室数据集检验算法的性能.本课题的研究对于提升质谱鉴定水平、促进生物技术进步具有重要意义.
中文关键词: 半监督学习;蛋白质鉴定;并行算法;多源数据融合
英文摘要: The complexity of biological samples and experimental process leads to rich noise in mass spectrogram, resulting in a large quantity of incorrect peptide spectrum matches (PSMs) in SEQUEST’s search results. Machine learning methods for post-data processing is a critical step for protein identification. Current methods have high computational complexity and could not efficiently solve large-scale dataset rapidly. The rich multi-source heterogeneous information accumulated in biological experiments is not sufficiently utilized either. These factors have hampered the progress of protein identification. Designing special parallel algorithms and heterogeneous multi-source data fusion methods under an appropriate mathematical framework is the key for high-throughput MS/MS platform to achieve high accuracy. Semi-supervised learning model has strong generalization ability and is a powerful mathematical model for peptide identification. This subject intends to study the parallel algorithm for semi-supervised learning model for peptide identification. The main research issues of this subject includes: building the sample reduction rules for semi-supervised learning model; employing the alternative convex search framework to establish an efficient parallel algorithm; establishing a unified framework for multi-source heterogeneous data fusion and peptide identification; verifying the performance of the algorithms by testing laboratory data sets. This project is valuable for improving identification accuracy of mass spectrometry techniques and promoting the progress of biotechnology.
英文关键词: semi-supervised learning;protein identification;parallel computation;multi-source data fusion