基于质谱词典思想的谱库设计及理论谱预测研究

项目名称： 基于质谱词典思想的谱库设计及理论谱预测研究

项目编号： No.31270834

项目类型： 面上项目

立项/批准年度： 2013

项目学科： 生物科学

项目作者： 孙世伟

作者单位： 中国科学院计算技术研究所

项目金额： 70万元

中文摘要： 基于质谱技术的序列鉴定，是蛋白质组学的重要工具。现有的序列库搜索技术与de novo技术，受限于理论谱预测的精度；谱库技术理论上能避免理论谱预测的困难，但谱库收录数目的有限性往往导致查询失败。本课题采用"质谱词典"策略以克服上述困难。我们首先研究肽段片段断裂模式的保守性，将具有保守断裂模式的肽段片段收录于质谱词典；其次，对于未收录的肽段片段，依据"移动质子"假说，构建统计模型以预测其局部理论谱；最后，对于待查询质谱，先依据质谱词典中断裂模式标注出其可能的肽段片段，进而将各个标注组合成完整肽段。此策略的优势在于：即使待查询质谱作为一个整体未收录于谱库中，其局部质谱仍有可能已收录于肽段片段的词典中。初步结果表明：肽段短片段具有较强的断裂模式保守性；小规模的肽段片段词典即可标注绝大部分质谱；统计模型能够高精度地预测出肽段片段的理论谱。本项研究有助于提高谱库方法的准确性，扩展其应用范围。

中文关键词： 蛋白质组学；质谱；大数据；理论谱；

英文摘要： Tandem mass spectrum technique has emerged as one of the most effective technique for protein sequence identification. Both the database-searching technique and de novo technique suffer from the low accuracy in theoretical spectrum prediction. Theoretically speaking, the spectra datatbase technique escapes from this difficulty; however, the limited size of known spectra in a spectra database usually leads to failure when searching for a query spectrum. The study aims to circumvent this difficulty via using "spectra dictionary" tehcnique. Specifically, we first investigate the conservation of framentation pattern for a peptide segments, and gather the segments with convered fragmentation patterns to yield a spectra dicitionary. Of course, there is still possibility that a segment was not archived in the dicitionary. For these segments, a statistical model is proposed to predict their fragmentaton pattern according to the "mobile proton" hypothesis. Finally, the query spectrum will be annotated with peptide segment candidates via searching the dictionary; the full-length peptide sequence will be combined through these segment candidates. Preliminary experimental results suggest that: 1) peptide segments usually demonstrate converved fragmentatin pattern; 2) nearly all spectra can be explained even using a small

英文关键词： proteomics；mass spectrometry；big data；theoritical spectrum；

成为VIP会员查看完整内容