Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when different features are measured on different scales.We propose decision forest-based adaptive nearest neighbor estimators and show that they are able to effectively estimate posterior probabilities, conditional entropies, and mutual information even in the aforementioned settings.We provide an extensive study of efficacy for classification and posterior probability estimation, and prove certain forest-based approaches to be consistent estimators of the true posteriors and derived information-theoretic quantities under certain assumptions. In a real-world connectome application, we quantify the uncertainty about neuron type given various cellular features in the Drosophila larva mushroom body, a key challenge for modern neuroscience.
翻译:信息理论数量,如有条件的昆虫和相互信息,是量化不确定性的关键数据摘要。目前广泛使用的计算这类数量的方法依靠最近的近邻方法,在某些简单假设中,既有方法在性能和理论上都有很强的保证。然而,在高维环境中,当在不同尺度上测量不同特征时,现有方法失败。我们提出基于森林的决定性适应性的最近邻估计器,并表明它们能够有效地估计后生概率、有条件的寄生虫和即使在上述环境中的相互信息。我们广泛研究了分类和后生概率估计的功效,并证明某些基于森林的方法在某些假设下是真实的后生体和衍生信息理论数量的一致估计。在现实世界连接器应用中,我们量化了在Droophilia 幼苗体中各种细胞特征下神经型不确定性,这是现代神经科学面临的一个关键挑战。