We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs repeat-free founder graphs when constructed from a gapless MSA and repeat-free elastic founder graphs when constructed from a general MSA with gaps. We give a linear time algorithm and a parameterized near linear time algorithm to construct a repeat-free founder graph and a repeat-free elastic founder graph, respectively. We derive a tailored succinct index structure to support queries of arbitrary length in the paths of a repeat-free (elastic) founder graph. In addition, we show how to turn a repeat-free (elastic) founder graph into a Wheeler graph in polynomial time. Furthermore, we show that a property such as repeat-freeness is essential for indexability. In particular, we show that unless the Strong Exponential Time Hypothesis (SETH) fails, one cannot build an index on an elastic founder graph in polynomial time to support fast queries.
翻译:我们引入了一个基于最佳分解概念的缩略图, 目的是从多个序列对齐( MSA) 中重建创建者序列。 这种创建者序列的特征是, MISA 的每行都是创建者的重组。 一些线性时间动态编程算法以前设计了优化分割法, 以引导创建者块, 然后可以混入一组创建者序列。 所有的可能的连结序列都可以以创建者图表示。 我们观察了这些图形的关键属性: 如果节点标签( 断点段) 不重复在图形路径中, 这样的图表可以用于高效的字符串匹配 。 当从一个无间断的 MIMS 和重复的无弹性缩略图中建构出时, 我们称之为这些不重复的创建者图表。 我们给出了线性时间算法和近线性线性参数算法, 以构建一个重复的创建者图表和重复的无弹性创建者图表。 我们用一个定制的简洁的索引结构结构, 支持在一个不重复性的时间长度路径上, (弹性) 将一个稳定的创建者显示一个稳定的创建者图表。