Estimating the Shannon entropy of a discrete distribution from which we have only observed a small sample is challenging. Estimating other information-theoretic metrics, such as the Kullback-Leibler divergence between two sparsely sampled discrete distributions, is even harder. Existing approaches to address these problems have shortcomings: they are biased, heuristic, work only for some distributions, and/or cannot be applied to all information-theoretic metrics. Here, we propose a fast, semi-analytical estimator for sparsely sampled distributions that is efficient, precise, and general. Its derivation is grounded in probabilistic considerations and uses a hierarchical Bayesian approach to extract as much information as possible from the few observations available. Our approach provides estimates of the Shannon entropy with precision at least comparable to the state of the art, and most often better. It can also be used to obtain accurate estimates of any other information-theoretic metric, including the notoriously challenging Kullback-Leibler divergence. Here, again, our approach performs consistently better than existing estimators.
翻译:估计一种离散分布的香农酶,我们只从其中观察到一个小样本,这是很有挑战性的。估计其他信息理论测量方法,例如,Kullback-Leebler在两个分散采样的离散分布之间的差异,就更难了。现有的解决这些问题的方法有缺点:它们偏向、偏执、只用于某些分布,并且/或者不能应用于所有信息理论衡量方法。在这里,我们提议一个快速、半分析的估算器,用于稀散采样分布,效率高、精确和笼统。它的推断依据是概率性考虑,并使用一种等级的贝叶斯学方法尽可能从现有的少数观察中提取信息。我们的方法提供了对香农酶的估算,其精确性至少可以与艺术品的状态相比,而且通常更好。我们的方法还可以用来准确估计任何其他信息理论指标,包括臭名昭著的Kull back-Leverer差异。在这里,我们的方法也比现有的估测算器更一致地表现得更好。