天然活性小分子抗肿瘤靶点谱研究中大数据的统计分析

项目名称： 天然活性小分子抗肿瘤靶点谱研究中大数据的统计分析

项目编号： No.11461079

项目类型： 地区科学基金项目

立项/批准年度： 2015

项目学科： 数理科学和化学

项目作者： 潘蓄林

作者单位： 云南大学

项目金额： 36万元

中文摘要： 随着化学、生物技术的快速发展，抗肿瘤数据的获得变得快捷和便宜，使天然活性小分子抗肿瘤靶点谱的研究进入统计建模和数据分析时代。而传统的统计方法在处理天然活性小分子抗肿瘤靶点谱中出现的大量的混杂数据、高维数据、缺失数据、复杂数据、测量错误、异常值、相依及高维离散数据时使传统研究出现较大的偏差。本项目围绕天然活性小分子抗肿瘤靶点谱研究的大数据，运用统计的理论和方法研究不同来源的复杂数据及异常值和测量误差的统计方法，降低系统误差；研究高维或超高维数据样本协方差矩阵逆协方差矩阵特征值的不一致性及纠偏方法并构建天然活性小分子抗肿瘤靶点谱的网络结构；研究稀疏或近似稀疏条件下高维非参数和半参数模型的变量选择方法，对基因通路、活性指标、蛋白通路及代谢通路中的关键变量进行研究，选出靶点分子；研究非独立条件下如何控制大规模统计检验的虚假发现比例，控制检验的精度。项目研究将推动复杂生物数据统计模型的研究与发展。

中文关键词： 大数据；抗肿瘤靶点谱；稀疏性；超高维变量选择；大协方差矩阵估计

英文摘要： With the development of chemical and biological technology, a deluge of data has been generated as the result of the falling cost and instantly data methods. So, the research of natural bioactive molecule anti-tumor target spectrum (NBMTS) is experiencing an explosion of data and entering the era of statistical model and data analysis. Many traditional statistical methods that perform well for moderated sample size do not accurate to data from research of NBMTS, which is a data from many sources, high-dimensional data, missing data, complex data, errors of measurements, weak variable correlation and high-dimensional discrete data. Our project focus on statistical methods of 'Big data' from the research of NBMTS and study on the methods of removing systematic biases and the best normalization practice for aggregated data from numerous sources, dependent data, missing data, outliers. We will also build the networks of NBMTS by handled the inconsistency issue of high dimensional sample covariance matrix and resolved the biase of eigenvalue of the sample covariance matrix. A variable selection methods of nonparameter and semiparameter statistical model will be introduced by exploitation of sparsity or quasi-sparsity assumption, which is an essential concept for modern statistical methods applied to high dimentional data. The target molecules will be selected from the gene pathway, bioactivity indicators, protein pathway and metabolic pathway by the mothods we proposed. The method of false discovery control for large-scale simultaneous tests based on dependent assumption would been detailed studied, in order to control the accuration of statistical test. Our project will propell the development and research the model of complex biological data.

英文关键词： Big data;Anti-tumor target spectrum;sparsity;ultrahigh-dimensional variable selection;Estimating large covariance matrix

成为VIP会员查看完整内容

相关内容

大数据

关注 270

从各种各样类型的数据中，快速获得有价值信息的能力，就是大数据技术。明白这一点至关重要，也正是这一点促使该技术具备走向众多企业的潜力。大数据的4个“V”，或者说特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。前文提到的网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。