争取对重复性后果采取严格的限制措施 (Towards a Definitive Compressibility Measure for Repetitive Sequences)

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size $z$ of the Lempel--Ziv parse are frequently used to estimate it. The size $b \le z$ of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size $\gamma$ of the smallest string \emph{attractor}, was introduced. The measure $\gamma \le b$ lower bounds all the previous relevant ones, yet length-$n$ strings can be represented and efficiently indexed within space $O(\gamma\log\frac{n}{\gamma})$, which also upper bounds most measures. While $\gamma$ is certainly a better measure of repetitiveness than $b$, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in $o(\gamma\log n)$ space. In this paper, we study an even smaller measure, $\delta \le \gamma$, which can be computed in linear time, is monotonic, and allows encoding every string in $O(\delta\log\frac{n}{\delta})$ space because $z = O(\delta\log\frac{n}{\delta})$. We show that $\delta$ better captures the compressibility of repetitive strings. Concretely, we show that (1) $\delta$ can be strictly smaller than $\gamma$, by up to a logarithmic factor; (2) there are string families needing $\Omega(\delta\log\frac{n}{\delta})$ space to be encoded, so this space is optimal for every $n$ and $\delta$; (3) one can build run-length context-free grammars of size $O(\delta\log\frac{n}{\delta})$, whereas the smallest (non-run-length) grammar can be up to $\Theta(\log n/\log\log n)$ times larger; and (4) within $O(\delta\log\frac{n}{\delta})$ space we can not only...

翻译：在统计压缩中, 香农的英特律是明确的较低约束值 { dropy, 但对于重复序列的折叠性而言, 没有这样的明确度量。由于统计英特律不反映重复性, 经常使用像 Lempel- Ziv 剖面大小 $ 美元这样的自动度量来估算它。最小双向宏观方案的大小 $\ le z$, 可以更好地捕捉到通过复制- past 进程可以实现的, 尽管它已经完成, 并且不是符号附件上的单调值。最近, 一个更具原则性的度量量值, 最小的字符串的 $\ gmma$\ {emph{ atracr>。 $\ g\ listal\ blickr\ 内, 它也可以代表每个相关的量值 $美元。