生物序列大数据集模体发现算法的研究

项目名称： 生物序列大数据集模体发现算法的研究

项目编号： No.61502366

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 计算机科学学科

项目作者： 于强

作者单位： 西安电子科技大学

项目金额： 21万元

中文摘要： 模体发现对生物序列中定位有意义的序列片断起着非常重要的作用。模体发现的精确算法能在指定测度下找出输入序列中最优的模体，但近年来生物序列大数据集为精确算法在时间性能方面提出了新的挑战。本项目采用基于模式驱动的技术路线进行模体发现，以设计时间高效的模体发现的精确算法为基本目标，并分别从减少候选模体和加速候选模体验证两个角度确保算法具有高效的时间性能。首先，建立在大的序列数据集中选择参考序列的方法，使选出的参考序列在所有可能的参考序列中对应最小数量的候选模体；其次，建立占用存储空间小的由三个子串生成候选模体的方法，使得进一步减小候选模体的数量；最后，设计时间高效的模体验证算法，使在长度为n的序列上验证长度为l的候选模体的时间性能达到O(nl/log(n))。本项目所设计的精确算法将能够在含有数百条甚至更多序列的生物序列大数据集中快速地进行模体发现。

中文关键词： 生物序列；大数据集；模体发现；精确算法；模式驱动

英文摘要： Motif discovery plays an important role in locating significant sequence segments in biological sequences. The exact algorithms for motif discovery can report the optimal motif under a specified measure. In recent years, the large biological sequence datasets bring new challenges to the exact algorithms in the aspect of time performance. Aimed at designing efficient exact algorithms for motif discovery as the basic goal, we identify motifs based on pattern-driven, and guarantee good time performance by reducing candidate motifs and accelerating motif verification. At first, we establish the method for selecting reference sequences from large datasets, such that the selected reference sequences correspond to the minimum number of candidate motifs in all possible reference sequences. Secondly, we establish the method for generating candidate motifs from three substrings, which needs little storage space and further reduces candidate motifs. Finally, we design the efficient algorithm for motif verification, achieving the time performance of O(nl/log(n)) for verifying a candidate motif of length l in a sequence of length n. The exact algorithms designed by this project will identify motifs efficiently in large biological sequence datasets that contain hundreds or more sequences.

英文关键词： Biological sequences;Large datasets;Motif discovery;Exact algorithms;Pattern-driven

成为VIP会员查看完整内容