Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set $P$ of $n$ strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring $O(||P||+n^2)$ time using superlinear space, where $||P||$ is the cummulative sum of the lengths of strings in $P$. This was improved by Park et al. [SPIRE20] to $O(||P||\log n)$ time and $O(||P||)$ space using segment trees, and further to $O(||P||\frac{\log n}{\log \log n})$ for the word RAM model. Both these results described an open problem to compute HOG in optimal $O(||P||)$ time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures.
翻译:在生物信息学中,基因组组组组是一个根本性问题,对于基因组的一组重叠子字符串来说,其目的在于重建源基因组。解决这一问题的典型方法是使用组装图,例如德布鲁因图或重叠图,这些图中保留了有关此类重叠的部分信息。对于基因组组组算法,这些图在储存的重叠信息与可缩放性之间存在着一种权衡。因此,为了克服这两种方法的局限性,建议了等级重叠图(HOG) 。对于某一套基因组的一组重叠子子,目的是重建源基因组。对于某一套基因组,计算组的首种算法是由Cazaux和Rivals[IPL20]提供的,需要用超级线性空间来保存部分信息。对于基因组组组组来说,$P$是存储和可缩放的线长度的累积和总和。Park 和 Al. [SPIRI20] 将两者改进为简单( ⁇ P ⁇ gn) 任何时间和$O($) 空间,使用部分树进行计算的第一个算算算算,对于最优化的硬的硬的硬的模型,然后用一个硬的硬的硬的硬的硬的硬的硬体结构,用硬体,用硬的硬的硬的硬的硬的硬体,用硬的硬的硬体,用硬的硬的硬体,用硬体,用硬体,用硬体的硬体,用硬体,用硬体的硬的硬体,用硬体结构,用硬体,用硬体,用硬体,用硬体的硬体,用硬体的硬体的硬体,用硬体,用硬体的硬体的硬体的硬体的硬体的硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体,用硬体的硬体,用硬体的硬体的硬体的硬体的硬体,用硬体的硬体的硬体的硬体的硬体的硬体,用硬体,用硬体的硬体,用硬体的硬体的硬体的硬体的硬体的硬体的硬体的硬体的硬体