Genome assembly is a fundamental problem in Bioinformatics, where for a given set of overlapping substrings of a genome, the aim is to reconstruct the source genome. The classical approaches to solving this problem use assembly graphs, such as de Bruijn graphs or overlap graphs, which maintain partial information about such overlaps. For genome assembly algorithms, these graphs present a trade-off between overlap information stored and scalability. Thus, Hierarchical Overlap Graph (HOG) was proposed to overcome the limitations of both these approaches. For a given set $P$ of $n$ strings, the first algorithm to compute HOG was given by Cazaux and Rivals [IPL20] requiring $O(||P||+n^2)$ time using superlinear space, where $||P||$ is the cumulative sum of the lengths of strings in $P$. This was improved by Park et al. [SPIRE20] to $O(||P||\log n)$ time and $O(||P||)$ space using segment trees, and further to $O(||P||\frac{\log n}{\log \log n})$ for the word RAM model. Both these results described an open problem to compute HOG in optimal $O(||P||)$ time and space. In this paper, we achieve the desired optimal bounds by presenting a simple algorithm that does not use any complex data structures. At its core, our solution improves the classical result [IPL92] for a special case of the All Pairs Suffix Prefix (APSP) problem from $O(||P||+n^2)$ time to optimal $O(||P||)$ time, which may be of independent interest.
翻译:在生物信息学中,基因组组组是一个根本性问题,对于基因组的一组重叠子字符串来说,其目标在于重建源基因组。解决这一问题的经典方法是使用组装图(如德布鲁因图或重叠图)来解决这个问题的经典方法,这些组装图保留着部分关于此类重叠的信息。对于基因组组算法,这些图表在存储的信息重叠和可缩放性之间是一种权衡。因此,为了克服这两种方法的局限性,建议了等级重叠图(HOG)来克服这两种方法的局限性。对于一个固定的 $$($) 的基因组,由Cazaux 和 Rivals [IP20] 给出了计算HO($P ⁇ n) 或Rivals 的首次算法,其中$($) P ⁇ $($) 美元(美元) 和 $( ⁇ P ⁇ ( ⁇ ) 美元) 任何时间和 美元($( ⁇ ) 特别($) 使用断段树,然后由CO_O 最优化的解结果, 美元(美元) 美元) 在最优的硬的硬的纸中,用这些时间(O_O_) 时间,用最优的硬的解结果,用这些结果,用最优的硬的硬的硬的硬的硬的计算结果。