加权加权指数序列:纯度和效率 (Indexing Weighted Sequences: Neat and Efficient)

In a \emph{weighted sequence}, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example, in molecular biology where they are known under the name of Position-Weight Matrices. Given a probability threshold $\frac1z$, we say that a string $P$ of length $m$ occurs in a weighted sequence $X$ at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+m-1$ in $X$ is at least $\frac1z$. In this article, we consider an \emph{indexing} variant of the problem, in which we are to preprocess a weighted sequence to answer multiple pattern matching queries. We present an $O(nz)$-time construction of an $O(nz)$-sized index for a weighted sequence of length $n$ over a constant-sized alphabet that answers pattern matching queries in optimal, $O(m+Occ)$ time, where $Occ$ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of $\lfloor z \rfloor$ special strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We obtain a weighted index with the same complexities as in the most efficient previously known index by Barton et al. (CPM 2016), but our construction is significantly simpler. The most complex algorithmic tool required in the basic form of our index is the suffix tree which we use to develop a new, more straightforward index for the so-called property matching problem. We provide an implementation of our data structure. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. (EDBT 2016) and an improvement of the space complexity of their general index.

翻译：在 \ emph{ 加权序列 } 中, 对于序列的每一个位置和字母的每一个字母, 都指定了在此处位置出现此字母的概率。加权序列通常用于代表不精确或不确定的数据, 例如分子生物学中, 它们以位置- 重度矩阵的名称为名。根据概率阈值 $\ frac1z$, 我们说, 一个长度为美元的字符串以加权顺序 $x美元为单位美元。如果在位置 $i,\ ldots, i+m-1 美元的字母的直率性能直率美元, 美元美元美元美元。以美元美元的直率, 美元美元以美元美元的直率, 以美元为单位。以美元为单位的直线, 以美元直线的直线, 以美元直线直线的直线, 以直线 = 美元美元。