Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
翻译:估计离散时间序列的熵率是一个具有挑战性的问题,在神经科学、基因组学、图像处理和自然语言处理等众多领域具有重要应用。已经开发了许多方法来解决这个任务,通常基于通用数据压缩算法或基于基础过程分布的统计估计量。在这项工作中,我们提出了一种完全基于贝叶斯方法的熵估计方法。基于最近引入的基于贝叶斯上下文树的(BCT)框架,将离散时间序列建模为可变记忆马尔可夫链,我们展示出可以直接从诱导的熵率后验中进行抽样。这可以用来估计整个后验分布,提供比点估计更丰富的信息。我们开发了关于熵率后验分布的理论结果,包括一致性和渐近正态性的证明。该方法的实际效用在模拟和真实数据上得到了说明,发现其优于最先进的替代方法。