基于串联质谱数据的多肽鉴定半监督学习并行算法研究

项目名称： 基于串联质谱数据的多肽鉴定半监督学习并行算法研究

项目编号： No.61503412

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 其他

项目作者： 梁锡军

作者单位： 中国石油大学（华东）

项目金额： 18万元

中文摘要： 蛋白质样品和生物实验的复杂性使得质谱图富含噪声，导致质谱匹配引擎如SEQUEST产生的多肽谱匹配含有大量假阴性鉴定．基于机器学习方法进行后验数据处理是当前蛋白质鉴定的关键环节．现有方法计算复杂度较高，不能高效处理大规模生物实验数据集，也没有充分利用生物实验积累的丰富多源异质信息，制约了蛋白质组鉴定技术的发展．在恰当的数学模型下设计多肽鉴定的并行算法并将多源异质信息相融合是实现高通量高精度蛋白质鉴定的关键. 半监督学习模型推广能力强，是多肽鉴定的有力数学模型．本课题拟以多肽鉴定为应用背景，研究半监督学习模型的并行求解算法，包括：构建半监督学习模型的样本约简规则；采用交替凸搜索算法框架，建立半监督学习模型的高效并行算法；建立多源异质信息融合与多肽鉴定的统一学习框架；通过实验室数据集检验算法的性能．本课题的研究对于提升质谱鉴定水平、促进生物技术进步具有重要意义．

中文关键词： 半监督学习；蛋白质鉴定；并行算法；多源数据融合

英文摘要： The complexity of biological samples and experimental process leads to rich noise in mass spectrogram, resulting in a large quantity of incorrect peptide spectrum matches (PSMs) in SEQUEST’s search results. Machine learning methods for post-data processing is a critical step for protein identification. Current methods have high computational complexity and could not efficiently solve large-scale dataset rapidly. The rich multi-source heterogeneous information accumulated in biological experiments is not sufficiently utilized either. These factors have hampered the progress of protein identification. Designing special parallel algorithms and heterogeneous multi-source data fusion methods under an appropriate mathematical framework is the key for high-throughput MS/MS platform to achieve high accuracy. Semi-supervised learning model has strong generalization ability and is a powerful mathematical model for peptide identification. This subject intends to study the parallel algorithm for semi-supervised learning model for peptide identification. The main research issues of this subject includes: building the sample reduction rules for semi-supervised learning model; employing the alternative convex search framework to establish an efficient parallel algorithm; establishing a unified framework for multi-source heterogeneous data fusion and peptide identification; verifying the performance of the algorithms by testing laboratory data sets. This project is valuable for improving identification accuracy of mass spectrometry techniques and promoting the progress of biotechnology.

英文关键词： semi-supervised learning;protein identification;parallel computation;multi-source data fusion

成为VIP会员查看完整内容

相关内容

半监督学习

关注 2924

半监督学习(Semi-Supervised Learning，SSL)是模式识别和机器学习领域研究的重点问题，是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据，以及同时使用标记数据，来进行模式识别工作。当使用半监督学习时，将会要求尽量少的人员来从事工作，同时，又能够带来比较高的准确性，因此，半监督学习目前正越来越受到人们的重视。

【干货书】数据挖掘药物发现，347页pdf

专知会员服务

135+阅读 · 2021年9月20日

【KDD2021】基于生成对抗图网络的不平衡网络嵌入

专知会员服务

27+阅读 · 2021年9月10日

【博士论文】基于深度学习的联合实体关系抽取

专知会员服务

91+阅读 · 2021年9月4日

元学习-生物医学中连接标记和未标记数据

专知会员服务

30+阅读 · 2021年8月3日