Genome assembly, the process of reconstructing a long genetic sequence by aligning and merging short fragments, or reads, is known to be NP-hard, either as a version of the shortest common superstring problem or in a Hamiltonian-cycle formulation. That is, the computing time is believed to grow exponentially with the the problem size in the worst case. Despite this fact, high-throughput technologies and modern algorithms currently allow bioinformaticians to produce and assemble datasets of billions of reads. Using methods from statistical mechanics, we address this conundrum by demonstrating the existence of a phase transition in the computational complexity of the problem and showing that practical instances always fall in the `easy' phase (solvable by polynomial-time algorithms). In addition, we propose a Markov-chain Monte Carlo method that outperforms common deterministic algorithms in the hard regime.
翻译:基因组组组,通过对短片进行对齐和合并来重建长基因序列的过程,或读作,已知是NP-硬的,或者作为最短的共同超绳问题的一种版本,或者作为汉密尔顿周期配方的版本,也就是说,计算时间被认为随着问题规模的增大而成倍增长。尽管如此,高通量技术和现代算法目前允许生物信息学家制作和组装数十亿读数的数据集。我们使用统计机械学的方法,通过证明问题计算复杂性存在一个阶段性过渡,并表明实际事例总是在“容易”阶段(多米时算法可以解决 ) 。 此外,我们建议采用马可夫-链蒙特卡洛方法,该方法在硬制度下比普通的确定算法要好。