Reducing the cost of sequencing genomes provided by next-generation sequencing technologies has greatly increased the number of genomic projects. As a result, there is a growing need for better assembly and assembly validation methods. One promising idea is to use heterogeneous data in assembly projects. Optical Mapping (OM) is beneficial in validating genomic assemblies, correction and scaffolding. Single raw OM read describes a DNA molecule's long fragment, up to 1Mbp. Raw OM data from the same genome could be assembled to create consensus maps that span an entire chromosome. The assembly process is computationally hard because of the large number of errors in input data. This work describes a new algorithm and computer program to assemble OM reads without a reference genome. In our algorithm, we explored binary representation for genome maps. We focused on the efficiency of data structures and algorithms and scale on parallel platforms. The algorithm consists of several steps, of which the most important are : (1) conversion of the restriction maps into binary strings, (2) detection of overlaps between restriction maps, (3) determining the layout of restriction maps set, (4) creation of consensus genomic maps. Our algorithm deals with optical mapping data with low error levels but fails with high-level error reads. We developed a software library, console application and module for Python language. The approach presented in this paper proved to be faster than a dynamic programming approach and performed well on error-free data. It could be used as a step of \textit{de~novo} assembly pipelines or to detect misassemblies.The software is freely available in a public repository under GNU LGPL v3 license (https://sourceforge.net/p/binary-genome-maps/code).
翻译:降低由下一代测序技术提供的基因组测序成本,大大增加了基因组项目的数量。因此,越来越需要更好的组装和组装验证方法。一个大有希望的想法是,在组装项目中使用各种数据。光学绘图(OM)有助于验证基因组组组组、校正和脚架。单生OM读数描述了DNA分子的长片,最高可达1Mbp。同一基因组的原始OM数据可以组组装,以创建跨越整个染色体的协商一致地图。因此,组装过程在计算上是困难的,因为输入中数据流数据流数据流数据流数据流中有大量错误。 我们的算法以数据结构、算法和在平行平台上的规模为主。算法包括几个步骤,其中最重要的可能是:(1) 将限制地图转换为二进制字符串联,(2) 检测限制地图之间的重叠,(3) 确定限制地图的布局,(4) 将OM组装成不参考基因组组组。在我们的算法中,我们与数据解算系统进行了高水平的计算。我们用了一个数字解算算法,用了一个高水平,用了一个数字解算算算算法,用了一个高的模型进行数据流数据流数据流数据流数据流数据流数据流数据流数据流数据流数据流数据流。