项目名称: 面向进化基因组学的高通量测序数据流形建模
项目编号: No.11471313
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 数理科学和化学
项目作者: 蔡云鹏
作者单位: 中国科学院深圳先进技术研究院
项目金额: 70万元
中文摘要: 进化基因组学分析是计算生物学的热点领域之一,是在分子生物学层面揭示生物对环境适应机制的主要方法。本项目拟提出将流形建模引入进化基因组学分析以准确描绘生物基因进化过程细节的新思路,正确提取并展示基因序列数据在空间的拓扑结构和群聚关系,准确复原高通量测序数据所反映的基因进化路径。本项目在申请人前期工作的基础上,拟提出运用伪度量空间分割树的非欧氏空间快速k近邻列表构造方法,和基于拓扑相似性的细节保真抽样方法,解决流形建模技术应用到大规模数据分析的计算负荷问题。更进一步,将约束聚类和子空间聚类的思想引入流形建模,利用少量标定样本确定具有特定分类意义的流形子空间,准确区分流形空间的有效维度和干扰维度,寻找数据集在流形空间上的自然分类边界,准确复原基因进化路径的结构细节,为基因组学研究提供有力的分析工具,同时为发展流形建模的新方法提供思路。
中文关键词: 流形学习;流形聚类;计算生物学;进化基因组学;高通量测序
英文摘要: Evolutionary genomics is a hot topic in computational biology, which focuses on discovering the adaptation mechanism of life organisms to environmental changes at molecular biology level. In this project we propose a novel idea of adopting manifold modeling to precisely exhibit the details in gene evolution, and correctly explore the topology structures of gene sequencing data in the sequence space as well as their clustering formations, and accurately recover the gene evolutionary path concealed in next-generation sequencing data. Based on our previous works, we propose an efficient method of constructing the k-nearest neighbor list for data in non-Euclidean space rapidly with the aid of pseudo-metric space partitioning tree, and a detail-preserving sampling method based on topological similarity, which circumvents the computational difficulties of applying manifold learning to large-scale data. Moreover, we introduce the concept of constrained clustering and sub-space clustering into manifold modeling, which makes use of a small number of labelled data to determine a manifold subspace that reflects specially-defined classification significance. In this way the dimensions in the manifold space introduced by meanful variations are accurately discriminated with those introduced by random noises. Moreover, by constrained clustering the the natural borders of the data set in the manifold subspace are explored, as well as the structure details of gene evolutionary path. The implementation of the project will lead to a powerful tool for genomics analysis and provide insights to development of new methods for manifold modeling.
英文关键词: manifold learning;manifold clustering;computational biology;evolutionary genomics;next-generation sequencing