During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches.
翻译:在病毒感染期间,宿主内部突变和重组可能导致重大演变,从而导致大量病毒的形成,这些病毒中含有多种机型。从短读顺序数据中重建这些机型的任务称为病毒准物种组装,可以归类为多组问题。我们考虑问题的新版本,没有参考可查。我们介绍ViQUF,一个新病毒准物种组装师,解决机型组装和量化问题。ViQUF从 de Bruijn 图中获取组装图第一稿。然后,在为每对相邻的脊椎建立的流动网络上找到一个小成本流。根据配对终端信息为每对相邻的脊椎建立的一个流动网络,解决一个小成本流,就可以产生一个近的配对组装图,以推荐的频率标注为边缘标签,这是第一个频率的估算。然后,我们介绍原始的机型组装型是在大约配组装配图中以微成本流程解决方案为指导的贪婪重建路径。ViQUF在频率估算中产出了配置的组合图。根据实际和模拟的频率推算结果,在实际和模拟的每对图的精度上,最精确的计算方法显示,比以往的频率要快的频率要快的频率要快得多。在前的模型中,比前的比前的频率要少。