一种用于序列数据非参数熵估计的高效最长公共子串快速算法 (A faster algorithm for efficient longest common substring calculation for non-parametric entropy estimation in sequential data)

from arxiv, Package available at https://pypi.org/project/LCSFinder, codebase available at https://github.com/bridget-smart/LCSFinder. It also includes a heuristic variant offering excellent practical performance with slower worst-case complexity, and has been integrated into the ProcessEntropy (https://pypi.org/project/ProcessEntropy/) package

Non-parametric entropy estimation on sequential data is a fundamental tool in signal processing, capturing information flow within or between processes to measure predictability, redundancy, or similarity. Methods based on longest common substrings (LCS) provide a non-parametric estimate of typical set size but are often inefficient, limiting use on real-world data. We introduce LCSFinder, a new algorithm that improves the worst-case performance of LCS calculations from cubic to log-linear time. Although built on standard algorithmic constructs - including sorted suffix arrays and persistent binary search trees - the details require care to provide the matches required for entropy estimation on dynamically growing sequences. We demonstrate that LCSFinder achieves dramatic speedups over existing implementations on real and simulated data, enabling entropy estimation at scales previously infeasible in practical signal processing.

翻译：序列数据的非参数熵估计是信号处理中的基础工具，通过捕捉过程内或过程间的信息流来度量可预测性、冗余度或相似性。基于最长公共子串的方法提供了典型集大小的非参数估计，但通常效率低下，限制了其在实际数据中的应用。我们提出了LCSFinder算法，该算法将最长公共子串计算的最坏情况性能从立方时间提升至对数线性时间。尽管该算法基于标准算法结构——包括排序后缀数组和持久化二叉搜索树——但具体实现需要精细设计，以满足动态增长序列的熵估计所需的匹配要求。我们通过真实数据和模拟数据验证，LCSFinder相比现有实现取得了显著的加速效果，使得在以往实际信号处理中难以实现的规模上进行熵估计成为可能。