Pattern matching on graphs has been widely studied lately due to its importance in genomics applications. Unfortunately, even the simplest problem of deciding if a string appears as a subpath of a graph admits a quadratic lower bound under the Orthogonal Vectors Hypothesis (Equi et al. ICALP 2019, SOFSEM 2021). To avoid this bottleneck, the research has shifted towards more specific graph classes, e.g. those induced from multiple sequence alignments (MSAs). Consider segmenting $\mathsf{MSA}[1..m,1..n]$ into $b$ blocks $\mathsf{MSA}[1..m,1..j_1]$, $\mathsf{MSA}[1..m,j_1+1..j_2]$, $\ldots$, $\mathsf{MSA}[1..m,j_{b-1}+1..n]$. The distinct strings in the rows of the blocks, after the removal of gap symbols, form the nodes of an elastic founder graph (EFG) where the edges represent the original connections observed in the MSA. An EFG is called indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and gave an $O(mn \log m)$-time algorithm for preprocessing the MSA in a way that allows the construction of indexable EFGs maximizing the number of blocks and, alternatively, minimizing the maximum length of a block, in $O(n)$ and $O(n \log\log n)$ time respectively. Using the suffix tree and solving a novel ancestor problem on trees, we improve the preprocessing to $O(mn)$ time and the $O(n \log \log n)$-time EFG construction to $O(n)$ time, thus showing that both types of indexable EFGs can be constructed in time linear in the input size.
翻译:图表上的匹配模式最近由于在基因组应用中的重要性而被广泛研究。 不幸的是,即使决定字符串是否作为图的子路径显示,也存在最简单的问题。 即使是决定字符串是否作为图的子路径, 是否接受Orthognal Vectors Hypothes (Equi et al.CricP 2019, SOFSEM 2021)。 为了避免这一瓶颈, 研究已经转向更具体的图表类别, 例如, 多序列对齐( MSAs) 引发的( MSAs) (MSA) 。 考虑 $\mathsf{MSA} [1. 0. m. n. n. n. n. n] 美元( 美元) 在 $bxxxxxxxxx 中, Orxxxxxxxxxxxx 的构建时间, 在Oxxxxxxxxxx 的构建中, 也代表了OIF 的构建速度。