Estimating mutual information (MI) between two continuous random variables $X$ and $Y$ allows to capture non-linear dependencies between them, non-parametrically. As such, MI estimation lies at the core of many data science applications. Yet, robustly estimating MI for high-dimensional $X$ and $Y$ is still an open research question. In this paper, we formulate this problem through the lens of manifold learning. That is, we leverage the common assumption that the information of $X$ and $Y$ is captured by a low-dimensional manifold embedded in the observed high-dimensional space and transfer it to MI estimation. As an extension to state-of-the-art $k$NN estimators, we propose to determine the $k$-nearest neighbors via geodesic distances on this manifold rather than from the ambient space, which allows us to estimate MI even in the high-dimensional setting. An empirical evaluation of our method, G-KSG, against the state-of-the-art shows that it yields good estimations of MI in classical benchmark and manifold tasks, even for high dimensional datasets, which none of the existing methods can provide.
翻译:估计两个连续随机变量之间的相互信息(MI)是两个连续随机变量(X美元和Y美元)之间的相互信息(MI),可以捕捉到它们之间的非线性依赖性,非对称。因此,MI估算是许多数据科学应用的核心。然而,对高维美元和Y美元进行强烈估算仍然是一个开放的研究问题。在本文中,我们从多方面学习的角度来分析这一问题。也就是说,我们利用共同的假设,即X美元和Y美元的信息被观测到的高维空间所含的低维数所捕获,并将其转移到MI的估算。作为向最先进的数据科学应用核心的扩展,我们提议通过高维距离而不是从环境空间来确定最近的邻,从而使我们能够在高维环境中对MI进行估算。我们的方法(G-KSG)的实证评估表明,在古典基准和多维度任务中,即使对于高维度数据系统,也不可能提供现有的方法。