Efficient and accurate estimation of multivariate empirical probability distributions is fundamental to the calculation of information-theoretic measures such as mutual information and transfer entropy. Common techniques include variations on histogram estimation which, whilst computationally efficient, are often unable to precisely capture the probability density of samples with high correlation, kurtosis or fine substructure, especially when sample sizes are small. Adaptive partitions, which adjust heuristically to the sample, can reduce the bias imparted from the geometry of the histogram itself, but these have commonly focused on the location, scale and granularity of the partition, the effects of which are limited for highly correlated distributions. In this paper, I reformulate the differential entropy estimator for the special case of an equiprobable histogram, using a k-d tree to partition the sample space into bins of equal probability mass. By doing so, I expose an implicit rotational orientation parameter, which is conjectured to be suboptimally specified in the typical marginal alignment. I propose that the optimal orientation minimises the variance of the bin volumes, and demonstrate that improved entropy estimates can be obtained by rotationally aligning the partition to the sample distribution accordingly. Such optimal partitions are observed to be more accurate than existing techniques in estimating entropies of correlated bivariate Gaussian distributions with known theoretical values, across varying sample sizes (99% CI).
翻译:高效准确地估计多元实际概率分布对于计算信息论量,如互信息和传输熵,是至关重要的。常见技术包括直方图估计的变化,虽然计算效率高,但通常无法准确捕获具有高相关性,峭度或细小细节结构的样本的概率密度,特别是当样本量较小时。自适应分区可以调整启发式地适应样本,可以减少直方图本身几何形状所产生的偏差,但这些通常集中在分区的位置,尺度和粒度,对高度相关的分布的影响是有限的。在本文中,我重新构建了差分熵估计器,用于均等概率的直方图,使用k-d树将样本空间划分为质量相等的箱。通过这样做,我暴露了一个隐含的旋转方向参数,据推测,在典型的边际对齐中,其规定是亚优的。我建议最优方向最小化箱子体积的方差,并展示了旋转将分区与样本分布对齐后可以获得更好的熵估计。在不同的样本大小下,这些优化分区的熵估计被观察到比现有技术更准确地估计已知理论值的相关性双变量高斯分布(99% CI) 。