项目名称: 天然活性小分子抗肿瘤靶点谱研究中大数据的统计分析
项目编号: No.11461079
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 数理科学和化学
项目作者: 潘蓄林
作者单位: 云南大学
项目金额: 36万元
中文摘要: 随着化学、生物技术的快速发展,抗肿瘤数据的获得变得快捷和便宜,使天然活性小分子抗肿瘤靶点谱的研究进入统计建模和数据分析时代。而传统的统计方法在处理天然活性小分子抗肿瘤靶点谱中出现的大量的混杂数据、高维数据、缺失数据、复杂数据、测量错误、异常值、相依及高维离散数据时使传统研究出现较大的偏差。本项目围绕天然活性小分子抗肿瘤靶点谱研究的大数据,运用统计的理论和方法研究不同来源的复杂数据及异常值和测量误差的统计方法,降低系统误差;研究高维或超高维数据样本协方差矩阵逆协方差矩阵特征值的不一致性及纠偏方法并构建天然活性小分子抗肿瘤靶点谱的网络结构;研究稀疏或近似稀疏条件下高维非参数和半参数模型的变量选择方法,对基因通路、活性指标、蛋白通路及代谢通路中的关键变量进行研究,选出靶点分子;研究非独立条件下如何控制大规模统计检验的虚假发现比例,控制检验的精度。项目研究将推动复杂生物数据统计模型的研究与发展。
中文关键词: 大数据;抗肿瘤靶点谱;稀疏性;超高维变量选择;大协方差矩阵估计
英文摘要: With the development of chemical and biological technology, a deluge of data has been generated as the result of the falling cost and instantly data methods. So, the research of natural bioactive molecule anti-tumor target spectrum (NBMTS) is experiencing an explosion of data and entering the era of statistical model and data analysis. Many traditional statistical methods that perform well for moderated sample size do not accurate to data from research of NBMTS, which is a data from many sources, high-dimensional data, missing data, complex data, errors of measurements, weak variable correlation and high-dimensional discrete data. Our project focus on statistical methods of 'Big data' from the research of NBMTS and study on the methods of removing systematic biases and the best normalization practice for aggregated data from numerous sources, dependent data, missing data, outliers. We will also build the networks of NBMTS by handled the inconsistency issue of high dimensional sample covariance matrix and resolved the biase of eigenvalue of the sample covariance matrix. A variable selection methods of nonparameter and semiparameter statistical model will be introduced by exploitation of sparsity or quasi-sparsity assumption, which is an essential concept for modern statistical methods applied to high dimentional data. The target molecules will be selected from the gene pathway, bioactivity indicators, protein pathway and metabolic pathway by the mothods we proposed. The method of false discovery control for large-scale simultaneous tests based on dependent assumption would been detailed studied, in order to control the accuration of statistical test. Our project will propell the development and research the model of complex biological data.
英文关键词: Big data;Anti-tumor target spectrum;sparsity;ultrahigh-dimensional variable selection;Estimating large covariance matrix