带有双向弦锁定的字符串抽样 (String Sampling with Bidirectional String Anchors)

from arxiv, Extended version of an ESA 2021 paper; it includes proofs omitted from the conference version and new results obtained by Michelle Sweering

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers $w$ and $k$, it selects the lexicographically smallest length-$k$ substring in every fragment of $w$ consecutive length-$k$ substrings (in every sliding window of length $w + k - 1$). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Two main disadvantages of minimizers sampling mechanisms are: first, they do not have good guarantees on the expected size of their samples for every combination of $w$ and $k$; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. We introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer $\ell$, our mechanism selects the lexicographically smallest rotation in every length-$\ell$ fragment (in every sliding window of length $\ell$). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to $\ell$; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. As a negative result, we show that computing a total order $\leq$ on the input alphabet, which minimizes the bd-anchors sample size, is NP-hard. We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time.

翻译：最小化采样机制是Schleimer等人[SIGMOD 2003年]和Roberts等人[Bioinf. 2004年]独立推出的弦采样流行机制。考虑到两个正整数,它选择了每块连续长度-k美元每片中最小的长度-k美元子字符串(每个滑动窗口的长度为$+k-1美元)。最小化采样器样本大致是统一的、本地的一致的、在线性时间内可比较的。最小化采样机制的两个主要缺点是:第一,它们对于每种美元和美元组合的样本的预期大小没有良好的保证;第二,在样本中构造的指数并不具有最坏的保证。我们引入双向导弦锚(b-nchors),任何新的弦采样机制。鉴于正正整数的直径直流值(美元),我们的机制选择了每段长度-美元比例最小的基底值采样机制。在每块中进行最小的旋转,它们没有良好的保证它们的规模;在每块的直径的计算中显示我们的直径小的数值,这些直径的基的样品中显示我们使用的直径的数值。