项目名称: 可扩展的蛋白质组学大数据存储与分析模型研究
项目编号: No.61501071
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 无线电电子学、电信技术
项目作者: 白明泽
作者单位: 重庆邮电大学
项目金额: 19万元
中文摘要: 随着高通量质谱测序的快速发展,蛋白质组学数据量迅速增加、数据格式多样化,呈现出明显的大数据特征;刚启动的“中国人类蛋白质组计划”也即将产生对中国生物、医学界影响深刻的大数据,迫切需要高效的生物知识发现平台解读数据。因此研究蛋白质组学数据基于大数据思想的存储和分析模型具有重要的理论意义和广泛的应用前景。本项目旨在研究可扩展的面向蛋白质组学大数据的存储、管理和分析模型: (1).研究蛋白质组学的数据特征以及具备数据集和功能可扩展性的“概念层-逻辑抽象层-物理层”分层数据模型;(2).研究多数据源汇集整合的“抽取-转换-导入”(ETL)流程;(3). 研究基于大数据存储模型的核心分析算法——高效序列搜索算法。通过以上研究,为蛋白质组组学提供有效的大数据管理和分析机制,推动我国生物和信息产业的进一步发展。
中文关键词: 蛋白质组学数据;大数据;存储模型;序列搜索算法;ETL
英文摘要: With the rapid development of high throughput mass spectrometry, the volume of Proteomics data increases sharply and the formats of data varies broadly. These facts indicate that the Proteomics data already becomes Big Data. The recently started up “China Human Proteome Project” will produce a whole lot of data which could influence Chinese biology and medical science significantly. The Proteomics data based high performance biology knowledge discovery platforms are needed eagerly, therefore investigating Big Data storage and analysis models for Proteomics data has high research significance and broad application prospect. This proposal plans to study on the scalable Big Data storage, management, and analysis models of Proteomics data: (1). Study on the data characteristics of Proteomics data and the “Concept layer, Logic Abstract layer and Physics layers” multilayers data storage model which is scalable on data volume and function; (2). Study on the “Extraction, Transformation and Loading (ETL)” process to integrate various Proteomics data sources. (3). Study on the high performance sequence search algorithm in peptide and protein identification and quantification, which is a kernel analysis algorithm. Through the above studies, the effective Proteomics Big Data management and analysis mechanics can be established, so as to promote the further development of China’s biological and information industries.
英文关键词: Proteomics Data;Big Data;Storage Model;Sequence Search Algorithm;ETL