We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length $n$ over an alphabet of size $\sigma$ and a fixed parameter $\lambda$, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected $\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1})$ overhead, where $F_j$ is the $j$-th number of the Fibonacci sequence. In the overall it uses $N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n)$ bits, where $N$ is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.
翻译:我们提出了一种名为SFDC(带有直接可访问性的简洁格式)的新的可变长度计算友好编码方案,它支持对压缩序列中的任何元素进行直接且快速的访问,并实现的压缩比通常比文献中提供的其他方案更高。SFDC方案提供了一种灵活且简单的表示形式,旨在提高实用效率或压缩比,视情况而定。对于长度为$n$的文本和大小为$\sigma$的字母表,以及一个固定参数$\lambda$,所提出的编码的访问时间与字符代码字长成正比,再加上预期的$\mathcal{O}((F_{\sigma-\lambda+3}-3)/F_{\sigma+1})$开销,其中$F_j$是斐波那契数列的第$j$个数字。总体上,它使用$N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \righ